Open-Meteo maintains an API for historical weather that allows for non-commercial usage of historical weather data maintained by the website.
This file builds on _v001, _v002, _v003, _v004, and _v005 to run exploratory analysis on some historical weather data.
The exploration process uses tidyverse, ranger, several generic custom functions, and several functions specific to Open Meteo processing. First, tidyverse, ranger, and the generic functions are loaded:
library(tidyverse) # tidyverse functionality is included throughout
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ranger) # predict() does not work on ranger objects unless ranger has been called
source("./Generic_Added_Utility_Functions_202105_v001.R") # Basic functions
Next, specific functions written in _v001, _v002, _v003, _v004, and _v005 are sourced:
source("./SimpleOneVar_Functions_202411_v001.R") # Functions for basic single variable analysis
source("./OpenMeteo_NextBest_202411_v001.R") # Functions for finding 'next best' predictor given existing model
source("./OpenMeteo_Functions_202411_v001.R") # Core functions for loading, processing, analysis of Open Meteo
source("./Generic_Analysis_Functions_202411_v001.R") # Additional functions for random forest and related analysis
Key mapping tables for available metrics are also copied:
hourlyMetrics <- "temperature_2m,relativehumidity_2m,dewpoint_2m,apparent_temperature,pressure_msl,surface_pressure,precipitation,rain,snowfall,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,shortwave_radiation,direct_radiation,direct_normal_irradiance,diffuse_radiation,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m,windgusts_10m,et0_fao_evapotranspiration,weathercode,vapor_pressure_deficit,soil_temperature_0_to_7cm,soil_temperature_7_to_28cm,soil_temperature_28_to_100cm,soil_temperature_100_to_255cm,soil_moisture_0_to_7cm,soil_moisture_7_to_28cm,soil_moisture_28_to_100cm,soil_moisture_100_to_255cm"
dailyMetrics <- "weathercode,temperature_2m_max,temperature_2m_min,apparent_temperature_max,apparent_temperature_min,precipitation_sum,rain_sum,snowfall_sum,precipitation_hours,sunrise,sunset,windspeed_10m_max,windgusts_10m_max,winddirection_10m_dominant,shortwave_radiation_sum,et0_fao_evapotranspiration"
hourlyDescription <- "Air temperature at 2 meters above ground\nRelative humidity at 2 meters above ground\nDew point temperature at 2 meters above ground\nApparent temperature is the perceived feels-like temperature combining wind chill factor, relative humidity and solar radiation\nAtmospheric air pressure reduced to mean sea level (msl) or pressure at surface. Typically pressure on mean sea level is used in meteorology. Surface pressure gets lower with increasing elevation.\nAtmospheric air pressure reduced to mean sea level (msl) or pressure at surface. Typically pressure on mean sea level is used in meteorology. Surface pressure gets lower with increasing elevation.\nTotal precipitation (rain, showers, snow) sum of the preceding hour. Data is stored with a 0.1 mm precision. If precipitation data is summed up to monthly sums, there might be small inconsistencies with the total precipitation amount.\nOnly liquid precipitation of the preceding hour including local showers and rain from large scale systems.\nSnowfall amount of the preceding hour in centimeters. For the water equivalent in millimeter, divide by 7. E.g. 7 cm snow = 10 mm precipitation water equivalent\nTotal cloud cover as an area fraction\nLow level clouds and fog up to 2 km altitude\nMid level clouds from 2 to 6 km altitude\nHigh level clouds from 6 km altitude\nShortwave solar radiation as average of the preceding hour. This is equal to the total global horizontal irradiation\nDirect solar radiation as average of the preceding hour on the horizontal plane and the normal plane (perpendicular to the sun)\nDirect solar radiation as average of the preceding hour on the horizontal plane and the normal plane (perpendicular to the sun)\nDiffuse solar radiation as average of the preceding hour\nWind speed at 10 or 100 meters above ground. Wind speed on 10 meters is the standard level.\nWind speed at 10 or 100 meters above ground. Wind speed on 10 meters is the standard level.\nWind direction at 10 or 100 meters above ground\nWind direction at 10 or 100 meters above ground\nGusts at 10 meters above ground of the indicated hour. Wind gusts in CERRA are defined as the maximum wind gusts of the preceding hour. Please consult the ECMWF IFS documentation for more information on how wind gusts are parameterized in weather models.\nET0 Reference Evapotranspiration of a well watered grass field. Based on FAO-56 Penman-Monteith equations ET0 is calculated from temperature, wind speed, humidity and solar radiation. Unlimited soil water is assumed. ET0 is commonly used to estimate the required irrigation for plants.\nWeather condition as a numeric code. Follow WMO weather interpretation codes. See table below for details. Weather code is calculated from cloud cover analysis, precipitation and snowfall. As barely no information about atmospheric stability is available, estimation about thunderstorms is not possible.\nVapor Pressure Deificit (VPD) in kilopascal (kPa). For high VPD (>1.6), water transpiration of plants increases. For low VPD (<0.4), transpiration decreases\nAverage temperature of different soil levels below ground.\nAverage temperature of different soil levels below ground.\nAverage temperature of different soil levels below ground.\nAverage temperature of different soil levels below ground.\nAverage soil water content as volumetric mixing ratio at 0-7, 7-28, 28-100 and 100-255 cm depths.\nAverage soil water content as volumetric mixing ratio at 0-7, 7-28, 28-100 and 100-255 cm depths.\nAverage soil water content as volumetric mixing ratio at 0-7, 7-28, 28-100 and 100-255 cm depths.\nAverage soil water content as volumetric mixing ratio at 0-7, 7-28, 28-100 and 100-255 cm depths."
dailyDescription <- "The most severe weather condition on a given day\nMaximum and minimum daily air temperature at 2 meters above ground\nMaximum and minimum daily air temperature at 2 meters above ground\nMaximum and minimum daily apparent temperature\nMaximum and minimum daily apparent temperature\nSum of daily precipitation (including rain, showers and snowfall)\nSum of daily rain\nSum of daily snowfall\nThe number of hours with rain\nSun rise and set times\nSun rise and set times\nMaximum wind speed and gusts on a day\nMaximum wind speed and gusts on a day\nDominant wind direction\nThe sum of solar radiaion on a given day in Megajoules\nDaily sum of ET0 Reference Evapotranspiration of a well watered grass field"
# Create tibble for hourly metrics
tblMetricsHourly <- tibble::tibble(metric=hourlyMetrics %>% str_split_1(","),
description=hourlyDescription %>% str_split_1("\n")
)
tblMetricsHourly %>%
print(n=50)
## # A tibble: 33 × 2
## metric description
## <chr> <chr>
## 1 temperature_2m Air temperature at 2 meters above ground
## 2 relativehumidity_2m Relative humidity at 2 meters above ground
## 3 dewpoint_2m Dew point temperature at 2 meters above ground
## 4 apparent_temperature Apparent temperature is the perceived feels-li…
## 5 pressure_msl Atmospheric air pressure reduced to mean sea l…
## 6 surface_pressure Atmospheric air pressure reduced to mean sea l…
## 7 precipitation Total precipitation (rain, showers, snow) sum …
## 8 rain Only liquid precipitation of the preceding hou…
## 9 snowfall Snowfall amount of the preceding hour in centi…
## 10 cloudcover Total cloud cover as an area fraction
## 11 cloudcover_low Low level clouds and fog up to 2 km altitude
## 12 cloudcover_mid Mid level clouds from 2 to 6 km altitude
## 13 cloudcover_high High level clouds from 6 km altitude
## 14 shortwave_radiation Shortwave solar radiation as average of the pr…
## 15 direct_radiation Direct solar radiation as average of the prece…
## 16 direct_normal_irradiance Direct solar radiation as average of the prece…
## 17 diffuse_radiation Diffuse solar radiation as average of the prec…
## 18 windspeed_10m Wind speed at 10 or 100 meters above ground. W…
## 19 windspeed_100m Wind speed at 10 or 100 meters above ground. W…
## 20 winddirection_10m Wind direction at 10 or 100 meters above ground
## 21 winddirection_100m Wind direction at 10 or 100 meters above ground
## 22 windgusts_10m Gusts at 10 meters above ground of the indicat…
## 23 et0_fao_evapotranspiration ET0 Reference Evapotranspiration of a well wat…
## 24 weathercode Weather condition as a numeric code. Follow WM…
## 25 vapor_pressure_deficit Vapor Pressure Deificit (VPD) in kilopascal (k…
## 26 soil_temperature_0_to_7cm Average temperature of different soil levels b…
## 27 soil_temperature_7_to_28cm Average temperature of different soil levels b…
## 28 soil_temperature_28_to_100cm Average temperature of different soil levels b…
## 29 soil_temperature_100_to_255cm Average temperature of different soil levels b…
## 30 soil_moisture_0_to_7cm Average soil water content as volumetric mixin…
## 31 soil_moisture_7_to_28cm Average soil water content as volumetric mixin…
## 32 soil_moisture_28_to_100cm Average soil water content as volumetric mixin…
## 33 soil_moisture_100_to_255cm Average soil water content as volumetric mixin…
# Create tibble for daily metrics
tblMetricsDaily <- tibble::tibble(metric=dailyMetrics %>% str_split_1(","),
description=dailyDescription %>% str_split_1("\n")
)
tblMetricsDaily
## # A tibble: 16 × 2
## metric description
## <chr> <chr>
## 1 weathercode The most severe weather condition on a given day
## 2 temperature_2m_max Maximum and minimum daily air temperature at 2 me…
## 3 temperature_2m_min Maximum and minimum daily air temperature at 2 me…
## 4 apparent_temperature_max Maximum and minimum daily apparent temperature
## 5 apparent_temperature_min Maximum and minimum daily apparent temperature
## 6 precipitation_sum Sum of daily precipitation (including rain, showe…
## 7 rain_sum Sum of daily rain
## 8 snowfall_sum Sum of daily snowfall
## 9 precipitation_hours The number of hours with rain
## 10 sunrise Sun rise and set times
## 11 sunset Sun rise and set times
## 12 windspeed_10m_max Maximum wind speed and gusts on a day
## 13 windgusts_10m_max Maximum wind speed and gusts on a day
## 14 winddirection_10m_dominant Dominant wind direction
## 15 shortwave_radiation_sum The sum of solar radiaion on a given day in Megaj…
## 16 et0_fao_evapotranspiration Daily sum of ET0 Reference Evapotranspiration of …
A previously existing hourly dataset is loaded, with key analysis variables defined in a vector:
# Load previous data
allCityHourly <- readFromRDS("allCity_20241116")
# Get core training variables
varsTrainHourly <- getVarsTrain(allCityHourly)
varsTrainHourly
## [1] "hour" "temperature_2m"
## [3] "relativehumidity_2m" "dewpoint_2m"
## [5] "apparent_temperature" "pressure_msl"
## [7] "surface_pressure" "precipitation"
## [9] "rain" "snowfall"
## [11] "cloudcover" "cloudcover_low"
## [13] "cloudcover_mid" "cloudcover_high"
## [15] "shortwave_radiation" "direct_radiation"
## [17] "direct_normal_irradiance" "diffuse_radiation"
## [19] "windspeed_10m" "windspeed_100m"
## [21] "winddirection_10m" "winddirection_100m"
## [23] "windgusts_10m" "et0_fao_evapotranspiration"
## [25] "weathercode" "vapor_pressure_deficit"
## [27] "soil_temperature_0_to_7cm" "soil_temperature_7_to_28cm"
## [29] "soil_temperature_28_to_100cm" "soil_temperature_100_to_255cm"
## [31] "soil_moisture_0_to_7cm" "soil_moisture_7_to_28cm"
## [33] "soil_moisture_28_to_100cm" "soil_moisture_100_to_255cm"
## [35] "year" "doy"
# Assign default label
keyLabel <- genericKeyLabelOM()
keyLabel
## [1] "predictions based on pre-2022 training data applied to 2022 holdout dataset"
# Get years of data available
allCityHourly %>%
mutate(year=year(date)) %>%
group_by(src) %>%
summarize(n=n(), minYear=min(year), maxYear=max(year), minDate=min(date), maxDate=max(date))
## # A tibble: 7 × 6
## src n minYear maxYear minDate maxDate
## <chr> <int> <dbl> <dbl> <date> <date>
## 1 Boston 122712 2010 2023 2010-01-01 2023-12-31
## 2 Chicago 122712 2010 2023 2010-01-01 2023-12-31
## 3 Houston 122712 2010 2023 2010-01-01 2023-12-31
## 4 LA 122712 2010 2023 2010-01-01 2023-12-31
## 5 Miami 122712 2010 2023 2010-01-01 2023-12-31
## 6 NYC 117936 2010 2023 2010-01-01 2023-06-15
## 7 Vegas 122712 2010 2023 2010-01-01 2023-12-31
A daily dataset is created from existing data:
omCityMap <- c("bos"="Boston MA",
"ord"="Chicago IL",
"hou"="Houston TX",
"lax"="Los Angeles CA",
"las"="Las Vegas NV",
"mia"="Miami FL",
"nyc"="New York NY"
)
allCityDaily <- map_dfr(.x=names(omCityMap),
.f=function(x) omRunAllSteps(dlData=FALSE, runStats=FALSE, abbCity=x, returnDF=TRUE),
.id="src"
) %>%
mutate(cityName=unname(omCityMap)[as.integer(src)])
##
## Objects in JSON include: latitude, longitude, generationtime_ms, utc_offset_seconds, timezone, timezone_abbreviation, elevation, daily_units, daily
##
## Rows: 5,113
## Columns: 21
## $ date <date> 2010-01-01, 2010-01-02, 2010-01-03, 2010-0…
## $ time <chr> "2010-01-01", "2010-01-02", "2010-01-03", "…
## $ weathercode <fct> 71, 73, 73, 3, 3, 3, 1, 71, 1, 0, 3, 3, 2, …
## $ temperature_2m_max <dbl> 3.4, 0.5, -1.5, 0.0, -0.4, -0.7, 1.6, -2.0,…
## $ temperature_2m_min <dbl> -2.4, -5.6, -8.8, -5.0, -5.9, -5.3, -4.9, -…
## $ apparent_temperature_max <dbl> -0.2, -4.1, -8.5, -5.8, -5.2, -6.7, -3.6, -…
## $ apparent_temperature_min <dbl> -6.6, -13.9, -17.0, -10.9, -11.3, -10.9, -1…
## $ precipitation_sum <dbl> 0.5, 8.9, 6.7, 0.0, 0.0, 0.0, 0.0, 0.2, 0.0…
## $ rain_sum <dbl> 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ snowfall_sum <dbl> 0.49, 5.53, 5.04, 0.00, 0.00, 0.00, 0.00, 0…
## $ precipitation_hours <dbl> 4, 24, 20, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0,…
## $ sunrise <dttm> 2010-01-01 07:13:00, 2010-01-02 07:13:00, …
## $ sunset <dttm> 2010-01-01 16:22:00, 2010-01-02 16:23:00, …
## $ windspeed_10m_max <dbl> 13.1, 34.7, 34.8, 28.1, 17.1, 20.7, 21.1, 2…
## $ windgusts_10m_max <dbl> 23.4, 64.8, 68.0, 50.8, 30.6, 38.9, 38.5, 3…
## $ winddirection_10m_dominant <int> 332, 334, 304, 304, 299, 302, 292, 319, 322…
## $ shortwave_radiation_sum <dbl> 6.58, 3.86, 4.04, 7.66, 7.55, 8.26, 8.82, 6…
## $ et0_fao_evapotranspiration <dbl> 0.66, 0.65, 0.69, 0.87, 0.76, 0.93, 1.01, 0…
## $ sunrise_chr <chr> "2010-01-01T07:13", "2010-01-02T07:13", "20…
## $ sunset_chr <chr> "2010-01-01T16:22", "2010-01-02T16:23", "20…
## $ fct_winddir <fct> 332, 334, 304, 304, 299, 302, 292, 319, 322…
##
## Objects in JSON include: latitude, longitude, generationtime_ms, utc_offset_seconds, timezone, timezone_abbreviation, elevation, daily_units, daily
##
## Rows: 23,376
## Columns: 21
## $ date <date> 1960-01-01, 1960-01-02, 1960-01-03, 1960-0…
## $ time <chr> "1960-01-01", "1960-01-02", "1960-01-03", "…
## $ weathercode <fct> 3, 51, 3, 2, 3, 1, 3, 3, 51, 51, 61, 63, 53…
## $ temperature_2m_max <dbl> 1.6, 4.9, -0.6, -5.9, -6.1, 1.0, 4.9, 1.6, …
## $ temperature_2m_min <dbl> -3.5, -0.9, -7.5, -11.8, -11.2, -10.3, -3.2…
## $ apparent_temperature_max <dbl> -4.3, -0.6, -6.9, -13.1, -13.6, -4.4, -1.3,…
## $ apparent_temperature_min <dbl> -7.9, -7.0, -14.4, -18.4, -17.7, -17.9, -8.…
## $ precipitation_sum <dbl> 0.0, 1.7, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.6…
## $ rain_sum <dbl> 0.0, 1.7, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.6…
## $ snowfall_sum <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
## $ precipitation_hours <dbl> 0, 11, 0, 0, 0, 0, 0, 0, 6, 1, 5, 24, 5, 10…
## $ sunrise <dttm> 1960-01-01 07:18:00, 1960-01-02 07:18:00, …
## $ sunset <dttm> 1960-01-01 16:29:00, 1960-01-02 16:30:00, …
## $ windspeed_10m_max <dbl> 22.7, 27.6, 27.0, 29.1, 29.6, 32.1, 32.7, 3…
## $ windgusts_10m_max <dbl> 40.0, 52.6, 44.6, 50.8, 48.2, 52.6, 56.5, 5…
## $ winddirection_10m_dominant <int> 142, 214, 268, 247, 261, 232, 234, 275, 185…
## $ shortwave_radiation_sum <dbl> 7.45, 2.25, 4.58, 8.66, 9.09, 8.79, 5.86, 8…
## $ et0_fao_evapotranspiration <dbl> 0.95, 0.66, 1.06, 1.06, 1.04, 1.33, 1.23, 1…
## $ sunrise_chr <chr> "1960-01-01T07:18", "1960-01-02T07:18", "19…
## $ sunset_chr <chr> "1960-01-01T16:29", "1960-01-02T16:30", "19…
## $ fct_winddir <fct> 142, 214, 268, 247, 261, 232, 234, 275, 185…
##
## Objects in JSON include: latitude, longitude, generationtime_ms, utc_offset_seconds, timezone, timezone_abbreviation, elevation, daily_units, daily
##
## Rows: 5,113
## Columns: 21
## $ date <date> 2010-01-01, 2010-01-02, 2010-01-03, 2010-0…
## $ time <chr> "2010-01-01", "2010-01-02", "2010-01-03", "…
## $ weathercode <fct> 3, 1, 3, 3, 0, 51, 55, 2, 0, 0, 1, 1, 53, 6…
## $ temperature_2m_max <dbl> 11.8, 12.0, 10.0, 7.6, 8.0, 12.7, 13.4, 0.8…
## $ temperature_2m_min <dbl> 3.9, 0.7, 4.4, 1.8, -1.9, -0.1, -0.2, -3.0,…
## $ apparent_temperature_max <dbl> 8.3, 8.8, 7.2, 4.5, 5.1, 10.9, 12.5, -5.7, …
## $ apparent_temperature_min <dbl> 1.0, -2.6, 1.2, -2.2, -5.5, -3.6, -6.8, -9.…
## $ precipitation_sum <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 1.1, 6.4, 0.0, 0.0…
## $ rain_sum <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 1.1, 6.4, 0.0, 0.0…
## $ snowfall_sum <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ precipitation_hours <dbl> 0, 0, 0, 0, 0, 4, 10, 0, 0, 0, 0, 0, 5, 17,…
## $ sunrise <dttm> 2010-01-01 08:17:00, 2010-01-02 08:17:00, …
## $ sunset <dttm> 2010-01-01 18:33:00, 2010-01-02 18:34:00, …
## $ windspeed_10m_max <dbl> 25.9, 11.2, 14.4, 21.6, 8.8, 17.4, 32.9, 22…
## $ windgusts_10m_max <dbl> 46.8, 23.4, 28.1, 42.1, 20.2, 33.8, 59.0, 4…
## $ winddirection_10m_dominant <int> 350, 89, 65, 344, 15, 126, 345, 345, 353, 7…
## $ shortwave_radiation_sum <dbl> 11.87, 13.86, 4.76, 13.82, 14.46, 4.55, 6.9…
## $ et0_fao_evapotranspiration <dbl> 1.69, 1.88, 1.03, 1.74, 1.55, 0.91, 1.20, 1…
## $ sunrise_chr <chr> "2010-01-01T08:17", "2010-01-02T08:17", "20…
## $ sunset_chr <chr> "2010-01-01T18:33", "2010-01-02T18:34", "20…
## $ fct_winddir <fct> 350, 89, 65, 344, 15, 126, 345, 345, 353, 7…
##
## Objects in JSON include: latitude, longitude, generationtime_ms, utc_offset_seconds, timezone, timezone_abbreviation, elevation, daily_units, daily
##
## Rows: 5,113
## Columns: 21
## $ date <date> 2010-01-01, 2010-01-02, 2010-01-03, 2010-0…
## $ time <chr> "2010-01-01", "2010-01-02", "2010-01-03", "…
## $ weathercode <fct> 2, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 61, 0, …
## $ temperature_2m_max <dbl> 20.1, 23.2, 23.0, 22.1, 22.9, 23.2, 23.3, 2…
## $ temperature_2m_min <dbl> 4.7, 6.7, 6.5, 6.5, 5.0, 7.7, 5.2, 8.4, 7.2…
## $ apparent_temperature_max <dbl> 17.5, 20.6, 19.9, 18.3, 19.8, 19.8, 20.2, 1…
## $ apparent_temperature_min <dbl> 0.9, 2.9, 2.7, 2.3, 0.8, 4.0, 1.6, 5.0, 3.5…
## $ precipitation_sum <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ rain_sum <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ snowfall_sum <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ precipitation_hours <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0…
## $ sunrise <dttm> 2010-01-01 07:59:00, 2010-01-02 08:00:00, …
## $ sunset <dttm> 2010-01-01 17:55:00, 2010-01-02 17:56:00, …
## $ windspeed_10m_max <dbl> 10.3, 13.0, 12.4, 11.4, 11.5, 9.8, 8.8, 12.…
## $ windgusts_10m_max <dbl> 23.0, 32.0, 31.3, 25.9, 24.8, 19.8, 18.4, 3…
## $ winddirection_10m_dominant <int> 7, 9, 20, 13, 9, 20, 22, 24, 17, 11, 26, 17…
## $ shortwave_radiation_sum <dbl> 8.99, 12.29, 11.80, 10.71, 12.54, 12.24, 12…
## $ et0_fao_evapotranspiration <dbl> 1.97, 2.67, 2.95, 2.74, 2.58, 2.60, 2.31, 2…
## $ sunrise_chr <chr> "2010-01-01T07:59", "2010-01-02T08:00", "20…
## $ sunset_chr <chr> "2010-01-01T17:55", "2010-01-02T17:56", "20…
## $ fct_winddir <fct> 7, 9, 20, 13, 9, 20, 22, 24, 17, 11, 26, 17…
##
## Objects in JSON include: latitude, longitude, generationtime_ms, utc_offset_seconds, timezone, timezone_abbreviation, elevation, daily_units, daily
##
## Rows: 5,113
## Columns: 21
## $ date <date> 2010-01-01, 2010-01-02, 2010-01-03, 2010-0…
## $ time <chr> "2010-01-01", "2010-01-02", "2010-01-03", "…
## $ weathercode <fct> 2, 0, 0, 1, 1, 1, 2, 1, 1, 2, 1, 2, 51, 1, …
## $ temperature_2m_max <dbl> 10.3, 14.2, 14.2, 13.3, 13.6, 15.8, 16.1, 1…
## $ temperature_2m_min <dbl> -1.3, -0.4, 0.7, 2.8, 0.7, 2.5, 6.0, 1.2, 0…
## $ apparent_temperature_max <dbl> 6.3, 10.7, 9.7, 9.1, 9.7, 12.0, 12.0, 6.6, …
## $ apparent_temperature_min <dbl> -4.9, -3.6, -2.8, -0.8, -3.2, -0.9, 3.0, -2…
## $ precipitation_sum <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ rain_sum <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ snowfall_sum <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ precipitation_hours <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0…
## $ sunrise <dttm> 2010-01-01 07:51:00, 2010-01-02 07:52:00, …
## $ sunset <dttm> 2010-01-01 17:36:00, 2010-01-02 17:37:00, …
## $ windspeed_10m_max <dbl> 8.4, 5.4, 9.7, 8.2, 6.4, 6.3, 13.7, 9.7, 5.…
## $ windgusts_10m_max <dbl> 18.0, 16.6, 24.1, 22.0, 18.7, 16.6, 32.0, 2…
## $ winddirection_10m_dominant <int> 327, 332, 341, 338, 317, 311, 3, 3, 317, 34…
## $ shortwave_radiation_sum <dbl> 10.98, 11.85, 11.72, 11.04, 11.88, 11.74, 1…
## $ et0_fao_evapotranspiration <dbl> 1.55, 1.62, 1.91, 1.84, 1.78, 1.80, 2.22, 1…
## $ sunrise_chr <chr> "2010-01-01T07:51", "2010-01-02T07:52", "20…
## $ sunset_chr <chr> "2010-01-01T17:36", "2010-01-02T17:37", "20…
## $ fct_winddir <fct> 327, 332, 341, 338, 317, 311, 3, 3, 317, 34…
##
## Objects in JSON include: latitude, longitude, generationtime_ms, utc_offset_seconds, timezone, timezone_abbreviation, elevation, daily_units, daily
##
## Rows: 5,113
## Columns: 21
## $ date <date> 2010-01-01, 2010-01-02, 2010-01-03, 2010-0…
## $ time <chr> "2010-01-01", "2010-01-02", "2010-01-03", "…
## $ weathercode <fct> 53, 1, 51, 3, 3, 1, 1, 2, 61, 3, 2, 3, 1, 3…
## $ temperature_2m_max <dbl> 26.6, 18.0, 16.7, 15.5, 14.9, 13.8, 16.6, 2…
## $ temperature_2m_min <dbl> 17.5, 11.6, 11.3, 9.0, 9.4, 6.3, 8.6, 11.6,…
## $ apparent_temperature_max <dbl> 25.7, 13.8, 13.6, 10.8, 10.8, 9.2, 14.3, 20…
## $ apparent_temperature_min <dbl> 14.5, 6.9, 6.6, 4.0, 5.0, 0.3, 3.9, 10.2, 1…
## $ precipitation_sum <dbl> 2.3, 0.0, 0.9, 0.0, 0.0, 0.0, 0.0, 0.0, 6.6…
## $ rain_sum <dbl> 2.3, 0.0, 0.9, 0.0, 0.0, 0.0, 0.0, 0.0, 6.6…
## $ snowfall_sum <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ precipitation_hours <dbl> 6, 0, 4, 0, 0, 0, 0, 0, 12, 0, 0, 0, 0, 0, …
## $ sunrise <dttm> 2010-01-01 08:07:00, 2010-01-02 08:07:00, …
## $ sunset <dttm> 2010-01-01 18:41:00, 2010-01-02 18:42:00, …
## $ windspeed_10m_max <dbl> 31.3, 29.2, 22.6, 21.0, 23.7, 27.4, 18.6, 1…
## $ windgusts_10m_max <dbl> 50.4, 46.8, 32.4, 36.7, 41.0, 42.1, 29.5, 2…
## $ winddirection_10m_dominant <int> 263, 350, 343, 324, 321, 329, 345, 247, 334…
## $ shortwave_radiation_sum <dbl> 11.06, 15.60, 12.41, 16.22, 14.96, 16.69, 1…
## $ et0_fao_evapotranspiration <dbl> 2.80, 3.50, 3.54, 3.76, 2.96, 3.13, 3.27, 2…
## $ sunrise_chr <chr> "2010-01-01T08:07", "2010-01-02T08:07", "20…
## $ sunset_chr <chr> "2010-01-01T18:41", "2010-01-02T18:42", "20…
## $ fct_winddir <fct> 263, 350, 343, 324, 321, 329, 345, 247, 334…
##
## Objects in JSON include: latitude, longitude, generationtime_ms, utc_offset_seconds, timezone, timezone_abbreviation, elevation, daily_units, daily
##
## Rows: 4,914
## Columns: 21
## $ date <date> 2010-01-01, 2010-01-02, 2010-01-03, 2010-0…
## $ time <chr> "2010-01-01", "2010-01-02", "2010-01-03", "…
## $ weathercode <fct> 73, 71, 71, 1, 1, 2, 2, 71, 0, 0, 2, 2, 3, …
## $ temperature_2m_max <dbl> 5.0, -0.6, -4.8, -0.8, -0.2, 1.1, 3.6, 1.9,…
## $ temperature_2m_min <dbl> -1.4, -9.2, -10.0, -7.3, -7.3, -5.3, -3.7, …
## $ apparent_temperature_max <dbl> 2.2, -6.0, -12.5, -6.6, -6.0, -5.0, -1.2, -…
## $ apparent_temperature_min <dbl> -5.6, -16.9, -17.6, -13.5, -12.4, -10.8, -9…
## $ precipitation_sum <dbl> 1.8, 0.7, 0.0, 0.0, 0.0, 0.0, 0.0, 0.7, 0.0…
## $ rain_sum <dbl> 0.3, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ snowfall_sum <dbl> 1.05, 0.70, 0.28, 0.00, 0.00, 0.00, 0.00, 0…
## $ precipitation_hours <dbl> 5, 4, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0…
## $ sunrise <dttm> 2010-01-01 08:19:00, 2010-01-02 08:19:00, …
## $ sunset <dttm> 2010-01-01 17:39:00, 2010-01-02 17:40:00, …
## $ windspeed_10m_max <dbl> 11.0, 27.2, 35.3, 20.5, 17.8, 20.2, 18.1, 2…
## $ windgusts_10m_max <dbl> 18.4, 53.3, 61.2, 43.6, 34.9, 39.2, 32.8, 3…
## $ winddirection_10m_dominant <int> 297, 307, 297, 301, 296, 301, 295, 292, 320…
## $ shortwave_radiation_sum <dbl> 5.34, 6.81, 5.64, 8.88, 9.33, 8.95, 9.53, 5…
## $ et0_fao_evapotranspiration <dbl> 0.56, 1.06, 1.09, 1.13, 1.02, 1.18, 1.11, 0…
## $ sunrise_chr <chr> "2010-01-01T08:19", "2010-01-02T08:19", "20…
## $ sunset_chr <chr> "2010-01-01T17:39", "2010-01-02T17:40", "20…
## $ fct_winddir <fct> 297, 307, 297, 301, 296, 301, 295, 292, 320…
allCityDaily %>%
mutate(year=year(date)) %>%
group_by(cityName) %>%
summarize(n=n(), minYear=min(year), maxYear=max(year), minDate=min(date), maxDate=max(date))
## # A tibble: 7 × 6
## cityName n minYear maxYear minDate maxDate
## <chr> <int> <dbl> <dbl> <date> <date>
## 1 Boston MA 5113 2010 2023 2010-01-01 2023-12-31
## 2 Chicago IL 23376 1960 2023 1960-01-01 2023-12-31
## 3 Houston TX 5113 2010 2023 2010-01-01 2023-12-31
## 4 Las Vegas NV 5113 2010 2023 2010-01-01 2023-12-31
## 5 Los Angeles CA 5113 2010 2023 2010-01-01 2023-12-31
## 6 Miami FL 5113 2010 2023 2010-01-01 2023-12-31
## 7 New York NY 4914 2010 2023 2010-01-01 2023-06-15
Hourly data is updated with city name to match daily data:
hourlyCityMap <- c("Boston"="Boston MA",
"Chicago"="Chicago IL",
"Houston"="Houston TX",
"LA"="Los Angeles CA",
"Vegas"="Las Vegas NV",
"Miami"="Miami FL",
"NYC"="New York NY"
)
allCityHourly <- allCityHourly %>%
mutate(cityName=hourlyCityMap[src])
allCityHourly %>%
group_by(src, cityName) %>%
summarize(n=n(), minDate=min(date), maxDate=max(date), .groups="drop")
## # A tibble: 7 × 5
## src cityName n minDate maxDate
## <chr> <chr> <int> <date> <date>
## 1 Boston Boston MA 122712 2010-01-01 2023-12-31
## 2 Chicago Chicago IL 122712 2010-01-01 2023-12-31
## 3 Houston Houston TX 122712 2010-01-01 2023-12-31
## 4 LA Los Angeles CA 122712 2010-01-01 2023-12-31
## 5 Miami Miami FL 122712 2010-01-01 2023-12-31
## 6 NYC New York NY 117936 2010-01-01 2023-06-15
## 7 Vegas Las Vegas NV 122712 2010-01-01 2023-12-31
Alignment of maximum temperature is explored:
maxTDelta <- allCityHourly %>%
group_by(cityName, date) %>%
summarize(maxT=max(temperature_2m), .groups="drop") %>%
filter(year(date)<=2022) %>%
left_join(select(allCityDaily, cityName, date, temperature_2m_max), by=c("cityName", "date")) %>%
mutate(delta=maxT-temperature_2m_max)
maxTDelta %>%
summary()
## cityName date maxT temperature_2m_max
## Length:33236 Min. :2010-01-01 Min. :-22.50 Min. :-22.80
## Class :character 1st Qu.:2013-04-01 1st Qu.: 14.70 1st Qu.: 14.70
## Mode :character Median :2016-07-01 Median : 23.60 Median : 23.60
## Mean :2016-07-01 Mean : 21.34 Mean : 21.33
## 3rd Qu.:2019-10-01 3rd Qu.: 28.80 3rd Qu.: 28.80
## Max. :2022-12-31 Max. : 45.90 Max. : 45.90
## delta
## Min. :-2.700000
## 1st Qu.: 0.000000
## Median : 0.000000
## Mean : 0.005277
## 3rd Qu.: 0.000000
## Max. : 5.900000
maxTDelta %>%
group_by(cityName) %>%
summarize(pctDiff=mean(abs(delta)>0.01))
## # A tibble: 7 × 2
## cityName pctDiff
## <chr> <dbl>
## 1 Boston MA 0
## 2 Chicago IL 0.104
## 3 Houston TX 0
## 4 Las Vegas NV 0
## 5 Los Angeles CA 0
## 6 Miami FL 0
## 7 New York NY 0
10% of observations in Chicago differ, potentially due to time zone conversions. All other cities align
The differences in Chicago observations are further explored:
maxTDelta %>%
filter(cityName=="Chicago IL") %>%
mutate(mon=month(date), year=year(date), isDiff=abs(delta)>0.01) %>%
group_by(year, mon) %>%
summarize(pctDiff=mean(isDiff), n=n(), .groups="drop") %>%
ggplot(aes(x=factor(month.abb[mon], levels=month.abb), y=pctDiff, group=1)) +
geom_line() +
geom_hline(data=~summarize(., yint=sum(n*pctDiff)/sum(n)), aes(yintercept=yint), lty=2) +
facet_wrap(~year) +
labs(x=NULL, y="Proportion mismatched", title="Mismatches between daily and hourly maximum temperature")
In general, mismatches seem to be more common in the colder season
A sample of the larger mismatches are explored:
tmpDates <- maxTDelta %>%
filter(cityName=="Chicago IL") %>%
mutate(mon=month(date), year=year(date), isDiff=abs(delta)>2.5) %>%
filter(isDiff) %>%
pull(date)
tmpDates
## [1] "2013-01-30" "2013-04-17" "2013-12-05" "2016-04-04" "2017-12-05"
## [6] "2018-05-20" "2019-01-20" "2020-12-24" "2022-02-17"
tmpDF <- allCityHourly %>%
filter(cityName=="Chicago IL") %>%
select(cityName, time, date, hour, temperature_2m)
map_dfr(.x=tmpDates,
.f=function(x) tmpDF %>% filter(date %in% c(x, x-1, x+1)),
.id="src"
) %>%
mutate(src=tmpDates[as.integer(src)], isSame=(src==date)) %>%
left_join(select(maxTDelta, cityName, src=date, temperature_2m_max), by=c("cityName", "src")) %>%
ggplot(aes(x=time, y=temperature_2m)) +
geom_line() +
geom_line(aes(y=temperature_2m_max), lty=2) +
facet_wrap(~src, scales="free") +
labs(x=NULL, y="Temperature (C)", title="Hourly temperatures near key disconnect dates")
Disconnects appear to arise when the maximum temperature is reached very early and/or very late in the day. This is consistent with time zones potentially driving the differences.
Shading is added for clarity:
map_dfr(.x=tmpDates,
.f=function(x) tmpDF %>% filter(date %in% c(x, x-1, x+1)),
.id="src"
) %>%
mutate(src=tmpDates[as.integer(src)], isSame=(src==date)) %>%
left_join(select(maxTDelta, cityName, src=date, temperature_2m_max), by=c("cityName", "src")) %>%
ggplot(aes(x=time, y=temperature_2m)) +
geom_line() +
geom_line(aes(y=temperature_2m_max), lty=2) +
facet_wrap(~src, scales="free") +
labs(x=NULL,
y="Temperature (C)",
title="Hourly temperatures near key disconnect dates",
subtitle="Line is hourly data, shaded region is the 24 hours in the key date",
caption="Dotted line is maximum temperature reported in daily data"
) +
geom_tile(data=~filter(., isSame), fill="lightblue", aes(x=time, y=0, height=+Inf), alpha=0.5)
Chicago data are shifted by an hour to check whether time zones fully explain the differences:
# Run previously
# tmpDF <- allCityHourly %>%
# filter(cityName=="Chicago IL") %>%
# select(cityName, time, date, hour, temperature_2m)
tmpDF %>%
mutate(newtime=time-hours(1), date=date(newtime)) %>%
filter(!(date %in% c(min(date)))) %>%
group_by(cityName, date) %>%
summarize(maxT=max(temperature_2m), .groups="drop") %>%
filter(year(date)<=2022) %>%
left_join(select(allCityDaily, cityName, date, temperature_2m_max), by=c("cityName", "date")) %>%
mutate(delta=maxT-temperature_2m_max) %>%
summary()
## cityName date maxT temperature_2m_max
## Length:4748 Min. :2010-01-01 Min. :-22.80 Min. :-22.80
## Class :character 1st Qu.:2013-04-01 1st Qu.: 4.90 1st Qu.: 4.90
## Mode :character Median :2016-07-01 Median : 14.90 Median : 14.90
## Mean :2016-07-01 Mean : 14.28 Mean : 14.28
## 3rd Qu.:2019-10-01 3rd Qu.: 24.20 3rd Qu.: 24.20
## Max. :2022-12-31 Max. : 37.40 Max. : 37.40
## delta
## Min. :0
## 1st Qu.:0
## Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
If date is recalculated based on time minus one hour, the Chicago hourly data aligns with the Chicago daily data for maximum temperature
Alignment of precipitation is explored, first with time zones in hourly data “as is”:
sumPDelta <- allCityHourly %>%
group_by(cityName, date) %>%
summarize(sumP=sum(precipitation), .groups="drop") %>%
filter(year(date)<=2022) %>%
left_join(select(allCityDaily, cityName, date, precipitation_sum), by=c("cityName", "date")) %>%
mutate(delta=sumP-precipitation_sum)
sumPDelta %>%
summary()
## cityName date sumP precipitation_sum
## Length:33236 Min. :2010-01-01 Min. : 0.000 Min. : 0.000
## Class :character 1st Qu.:2013-04-01 1st Qu.: 0.000 1st Qu.: 0.000
## Mode :character Median :2016-07-01 Median : 0.000 Median : 0.000
## Mean :2016-07-01 Mean : 2.575 Mean : 2.575
## 3rd Qu.:2019-10-01 3rd Qu.: 1.500 3rd Qu.: 1.600
## Max. :2022-12-31 Max. :143.900 Max. :143.900
## delta
## Min. :-14.7
## 1st Qu.: 0.0
## Median : 0.0
## Mean : 0.0
## 3rd Qu.: 0.0
## Max. : 14.7
sumPDelta %>%
group_by(cityName) %>%
summarize(pctDiff=mean(abs(delta)>0.005))
## # A tibble: 7 × 2
## cityName pctDiff
## <chr> <dbl>
## 1 Boston MA 0
## 2 Chicago IL 0.259
## 3 Houston TX 0
## 4 Las Vegas NV 0
## 5 Los Angeles CA 0
## 6 Miami FL 0
## 7 New York NY 0
25% of observations in Chicago differ, likely due to time zone conversions. All other cities align
Chicago data is updated to be one hour earlier:
sumPDelta <- allCityHourly %>%
mutate(date=as.Date(ifelse(cityName %in% c("Chicago IL"), time-hours(1), date))) %>%
group_by(cityName, date) %>%
summarize(sumP=sum(precipitation), .groups="drop") %>%
filter(year(date)<=2022) %>%
left_join(select(allCityDaily, cityName, date, precipitation_sum), by=c("cityName", "date")) %>%
mutate(delta=sumP-precipitation_sum)
sumPDelta %>%
summary()
## cityName date sumP precipitation_sum
## Length:28488 Min. :2010-01-01 Min. : 0.000 Min. : 0.000
## Class :character 1st Qu.:2013-04-01 1st Qu.: 0.000 1st Qu.: 0.000
## Mode :character Median :2016-07-01 Median : 0.000 Median : 0.000
## Mean :2016-07-01 Mean : 2.499 Mean : 2.499
## 3rd Qu.:2019-10-01 3rd Qu.: 1.400 3rd Qu.: 1.400
## Max. :2022-12-31 Max. :143.900 Max. :143.900
## delta
## Min. :-1.421e-14
## 1st Qu.: 0.000e+00
## Median : 0.000e+00
## Mean : 9.273e-18
## 3rd Qu.: 0.000e+00
## Max. : 7.105e-15
sumPDelta %>%
group_by(cityName) %>%
summarize(pctDiff=mean(abs(delta)>0.005))
## # A tibble: 6 × 2
## cityName pctDiff
## <chr> <dbl>
## 1 Boston MA 0
## 2 Houston TX 0
## 3 Las Vegas NV 0
## 4 Los Angeles CA 0
## 5 Miami FL 0
## 6 New York NY 0
After adjusting for time zone in Chicago, all observations align
The process is generalized in a function that adjusts only Chicago and with other significant hard-coding:
tmpDailyVsHourly <- function(df1=allCityHourly,
df2=allCityDaily,
chgCities=c("Chicago IL"),
chgHours=-1,
varDF1="precipitation",
varDF2="precipitation_sum",
fn=sum,
eqTol=0.005
) {
df <- df1 %>%
mutate(date=as.Date(ifelse(cityName %in% chgCities, time+hours(chgHours), date))) %>%
group_by(cityName, date) %>%
summarize(across(.cols=all_of(varDF1), .fns=fn, .names="agg"), .groups="drop") %>%
filter(year(date)<=2022) %>%
left_join(select(df2, all_of(c("cityName", "date", varDF2))),
by=c("cityName", "date")
) %>%
mutate(delta=agg-get(varDF2))
df %>%
summary() %>%
print()
df %>%
group_by(cityName) %>%
summarize(pctDiff=mean(abs(delta)>eqTol)) %>%
print()
}
tmpDailyVsHourly(varDF1="precipitation", varDF2="precipitation_sum", fn=sum, eqTol=0.005)
## cityName date agg precipitation_sum
## Length:28488 Min. :2010-01-01 Min. : 0.000 Min. : 0.000
## Class :character 1st Qu.:2013-04-01 1st Qu.: 0.000 1st Qu.: 0.000
## Mode :character Median :2016-07-01 Median : 0.000 Median : 0.000
## Mean :2016-07-01 Mean : 2.499 Mean : 2.499
## 3rd Qu.:2019-10-01 3rd Qu.: 1.400 3rd Qu.: 1.400
## Max. :2022-12-31 Max. :143.900 Max. :143.900
## delta
## Min. :-1.421e-14
## 1st Qu.: 0.000e+00
## Median : 0.000e+00
## Mean : 9.273e-18
## 3rd Qu.: 0.000e+00
## Max. : 7.105e-15
## # A tibble: 6 × 2
## cityName pctDiff
## <chr> <dbl>
## 1 Boston MA 0
## 2 Houston TX 0
## 3 Las Vegas NV 0
## 4 Los Angeles CA 0
## 5 Miami FL 0
## 6 New York NY 0
The function is tested for maximum temperature:
# No change in time
tmpDailyVsHourly(varDF1="temperature_2m", varDF2="temperature_2m_max", fn=max, eqTol=0.005, chgCities=c())
## cityName date agg temperature_2m_max
## Length:33236 Min. :2010-01-01 Min. :-22.50 Min. :-22.80
## Class :character 1st Qu.:2013-04-01 1st Qu.: 14.70 1st Qu.: 14.70
## Mode :character Median :2016-07-01 Median : 23.60 Median : 23.60
## Mean :2016-07-01 Mean : 21.34 Mean : 21.33
## 3rd Qu.:2019-10-01 3rd Qu.: 28.80 3rd Qu.: 28.80
## Max. :2022-12-31 Max. : 45.90 Max. : 45.90
## delta
## Min. :-2.700000
## 1st Qu.: 0.000000
## Median : 0.000000
## Mean : 0.005277
## 3rd Qu.: 0.000000
## Max. : 5.900000
## # A tibble: 7 × 2
## cityName pctDiff
## <chr> <dbl>
## 1 Boston MA 0
## 2 Chicago IL 0.104
## 3 Houston TX 0
## 4 Las Vegas NV 0
## 5 Los Angeles CA 0
## 6 Miami FL 0
## 7 New York NY 0
# Chicago shifted by -1 hours (default)
tmpDailyVsHourly(varDF1="temperature_2m", varDF2="temperature_2m_max", fn=max, eqTol=0.005)
## cityName date agg temperature_2m_max
## Length:28488 Min. :2010-01-01 Min. :-15.40 Min. :-15.40
## Class :character 1st Qu.:2013-04-01 1st Qu.: 16.50 1st Qu.: 16.50
## Mode :character Median :2016-07-01 Median : 24.60 Median : 24.60
## Mean :2016-07-01 Mean : 22.51 Mean : 22.51
## 3rd Qu.:2019-10-01 3rd Qu.: 29.30 3rd Qu.: 29.30
## Max. :2022-12-31 Max. : 45.90 Max. : 45.90
## delta
## Min. :0
## 1st Qu.:0
## Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
## # A tibble: 6 × 2
## cityName pctDiff
## <chr> <dbl>
## 1 Boston MA 0
## 2 Houston TX 0
## 3 Las Vegas NV 0
## 4 Los Angeles CA 0
## 5 Miami FL 0
## 6 New York NY 0
The function is tested for maximum and minimum apparent temperature:
# Maximum apparent temperature, with Chicago time shift
tmpDailyVsHourly(varDF1="apparent_temperature", varDF2="apparent_temperature_max", fn=max, eqTol=0.005)
## cityName date agg
## Length:28488 Min. :2010-01-01 Min. :-23.60
## Class :character 1st Qu.:2013-04-01 1st Qu.: 13.90
## Mode :character Median :2016-07-01 Median : 24.30
## Mean :2016-07-01 Mean : 22.04
## 3rd Qu.:2019-10-01 3rd Qu.: 31.40
## Max. :2022-12-31 Max. : 45.60
## apparent_temperature_max delta
## Min. :-23.60 Min. :0
## 1st Qu.: 13.90 1st Qu.:0
## Median : 24.30 Median :0
## Mean : 22.04 Mean :0
## 3rd Qu.: 31.40 3rd Qu.:0
## Max. : 45.60 Max. :0
## # A tibble: 6 × 2
## cityName pctDiff
## <chr> <dbl>
## 1 Boston MA 0
## 2 Houston TX 0
## 3 Las Vegas NV 0
## 4 Los Angeles CA 0
## 5 Miami FL 0
## 6 New York NY 0
# Minimum apparent temperature, with Chicago time shift
tmpDailyVsHourly(varDF1="apparent_temperature", varDF2="apparent_temperature_min", fn=min, eqTol=0.005)
## cityName date agg
## Length:28488 Min. :2010-01-01 Min. :-33.10
## Class :character 1st Qu.:2013-04-01 1st Qu.: 3.30
## Mode :character Median :2016-07-01 Median : 12.90
## Mean :2016-07-01 Mean : 12.12
## 3rd Qu.:2019-10-01 3rd Qu.: 22.00
## Max. :2022-12-31 Max. : 33.20
## apparent_temperature_min delta
## Min. :-33.10 Min. :0
## 1st Qu.: 3.30 1st Qu.:0
## Median : 12.90 Median :0
## Mean : 12.12 Mean :0
## 3rd Qu.: 22.00 3rd Qu.:0
## Max. : 33.20 Max. :0
## # A tibble: 6 × 2
## cityName pctDiff
## <chr> <dbl>
## 1 Boston MA 0
## 2 Houston TX 0
## 3 Las Vegas NV 0
## 4 Los Angeles CA 0
## 5 Miami FL 0
## 6 New York NY 0
The function is tested for sum of precipitation and snowfall:
# Precipitation, with Chicago time shift
tmpDailyVsHourly(varDF1="precipitation", varDF2="precipitation_sum", fn=sum, eqTol=0.005)
## cityName date agg precipitation_sum
## Length:28488 Min. :2010-01-01 Min. : 0.000 Min. : 0.000
## Class :character 1st Qu.:2013-04-01 1st Qu.: 0.000 1st Qu.: 0.000
## Mode :character Median :2016-07-01 Median : 0.000 Median : 0.000
## Mean :2016-07-01 Mean : 2.499 Mean : 2.499
## 3rd Qu.:2019-10-01 3rd Qu.: 1.400 3rd Qu.: 1.400
## Max. :2022-12-31 Max. :143.900 Max. :143.900
## delta
## Min. :-1.421e-14
## 1st Qu.: 0.000e+00
## Median : 0.000e+00
## Mean : 9.273e-18
## 3rd Qu.: 0.000e+00
## Max. : 7.105e-15
## # A tibble: 6 × 2
## cityName pctDiff
## <chr> <dbl>
## 1 Boston MA 0
## 2 Houston TX 0
## 3 Las Vegas NV 0
## 4 Los Angeles CA 0
## 5 Miami FL 0
## 6 New York NY 0
# Snowfall, with Chicago time shift
tmpDailyVsHourly(varDF1="snowfall", varDF2="snowfall_sum", fn=sum, eqTol=0.005)
## cityName date agg snowfall_sum
## Length:28488 Min. :2010-01-01 Min. : 0.00000 Min. : 0.00000
## Class :character 1st Qu.:2013-04-01 1st Qu.: 0.00000 1st Qu.: 0.00000
## Mode :character Median :2016-07-01 Median : 0.00000 Median : 0.00000
## Mean :2016-07-01 Mean : 0.08226 Mean : 0.08226
## 3rd Qu.:2019-10-01 3rd Qu.: 0.00000 3rd Qu.: 0.00000
## Max. :2022-12-31 Max. :32.83000 Max. :32.83000
## delta
## Min. :-3.553e-15
## 1st Qu.: 0.000e+00
## Median : 0.000e+00
## Mean : 1.270e-18
## 3rd Qu.: 0.000e+00
## Max. : 3.553e-15
## # A tibble: 6 × 2
## cityName pctDiff
## <chr> <dbl>
## 1 Boston MA 0
## 2 Houston TX 0
## 3 Las Vegas NV 0
## 4 Los Angeles CA 0
## 5 Miami FL 0
## 6 New York NY 0
The function is tested for hours of precipitation:
# Precipitation hours, with Chicago time shift
tmpDailyVsHourly(varDF1="precipitation", varDF2="precipitation_hours", fn=function(x) sum(x>0), eqTol=0.005)
## cityName date agg precipitation_hours
## Length:28488 Min. :2010-01-01 Min. : 0.000 Min. : 0.000
## Class :character 1st Qu.:2013-04-01 1st Qu.: 0.000 1st Qu.: 0.000
## Mode :character Median :2016-07-01 Median : 0.000 Median : 0.000
## Mean :2016-07-01 Mean : 2.992 Mean : 2.992
## 3rd Qu.:2019-10-01 3rd Qu.: 4.000 3rd Qu.: 4.000
## Max. :2022-12-31 Max. :24.000 Max. :24.000
## delta
## Min. :0
## 1st Qu.:0
## Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
## # A tibble: 6 × 2
## cityName pctDiff
## <chr> <dbl>
## 1 Boston MA 0
## 2 Houston TX 0
## 3 Las Vegas NV 0
## 4 Los Angeles CA 0
## 5 Miami FL 0
## 6 New York NY 0
A random forest model is created to estimate dewpoint based on temperature and apparent temperature given city and month:
# Use only 10% as training data
set.seed(25091014)
idxTrain_v1 <- sample(1:nrow(allCityHourly), round(0.1*nrow(allCityHourly)), replace=FALSE)
rfTestDewP <- runFullRF(dfTrain=allCityHourly[idxTrain_v1, ] %>%
mutate(fct_city=factor(cityName),
fct_mon=factor(month.abb[month(date)], levels=month.abb)
) %>%
select(dewpoint_2m, temperature_2m, apparent_temperature, fct_city, fct_mon),
dfTest=allCityHourly[-idxTrain_v1, ] %>%
mutate(fct_city=factor(cityName),
fct_mon=factor(month.abb[month(date)], levels=month.abb)
),
yVar="dewpoint_2m",
xVars=c("temperature_2m", "apparent_temperature", "fct_city", "fct_mon"),
isContVar=TRUE,
rndTo=-1L,
returnData=TRUE
)
##
## R-squared of test data is: 93.31% (RMSE 2.84 vs. 10.98 null)
## `geom_smooth()` using formula = 'y ~ x'
rfTestDewP$tstPred %>%
group_by(fct_city, fct_mon) %>%
summarize(n=n(),
tssActual=sum((dewpoint_2m-mean(dewpoint_2m))**2),
rssPred=sum((dewpoint_2m-pred)**2),
r2=1-rssPred/tssActual,
rmse=sqrt(rssPred/n),
.groups="drop"
) %>%
ggplot(aes(x=fct_city, y=fct_mon)) +
geom_tile(aes(fill=r2)) +
geom_text(aes(label=round(r2, 3)), size=3) +
scale_fill_continuous("R2", low="white", high="lightgreen", limits=c(0, 1)) +
labs(x=NULL, y=NULL, title="R2 for dewpoint predictions")
Actual daily dewpoints are calculated using hourly data:
dewPDaily <- allCityHourly %>%
mutate(fct_city=factor(cityName),
fct_mon=factor(month.abb[month(date)], levels=month.abb)
) %>%
group_by(fct_city, fct_mon, date) %>%
summarize(across(dewpoint_2m, .fns=list(mean=mean, min=min, max=max)), .groups="drop")
dewPDaily
## # A tibble: 35,592 × 6
## fct_city fct_mon date dewpoint_2m_mean dewpoint_2m_min dewpoint_2m_max
## <fct> <fct> <date> <dbl> <dbl> <dbl>
## 1 Boston MA Jan 2010-01-01 -3.48 -5.1 -1.5
## 2 Boston MA Jan 2010-01-02 -4.96 -10 -0.6
## 3 Boston MA Jan 2010-01-03 -9.23 -12.8 -4.7
## 4 Boston MA Jan 2010-01-04 -6.86 -8.6 -4.7
## 5 Boston MA Jan 2010-01-05 -8.16 -9.4 -6.5
## 6 Boston MA Jan 2010-01-06 -8.53 -9.3 -7.4
## 7 Boston MA Jan 2010-01-07 -7.86 -9.2 -5.5
## 8 Boston MA Jan 2010-01-08 -8.65 -10.9 -7.5
## 9 Boston MA Jan 2010-01-09 -13.5 -17 -11.1
## 10 Boston MA Jan 2010-01-10 -16.4 -18 -14.4
## # ℹ 35,582 more rows
Annual average dewpoints by city are explored:
dewPDaily %>%
mutate(year=year(date)) %>%
filter(year>=2010, year<2023) %>%
group_by(year, fct_city) %>%
summarize(mudp=mean(dewpoint_2m_mean), .groups="drop") %>%
ggplot(aes(x=factor(year), y=mudp)) +
geom_line(aes(group=fct_city, color=fct_city), lwd=1) +
labs(title="Average dewpoint by city and year", y="Average dewpoint (C)", x=NULL) +
scale_color_discrete(NULL)
Monthly average dewpoints by city are explored:
dewPDaily %>%
mutate(year=year(date)) %>%
filter(year>=2010, year<2023) %>%
group_by(fct_city, fct_mon) %>%
summarize(mudp=mean(dewpoint_2m_mean),
maxdp=max(dewpoint_2m_mean),
mindp=min(dewpoint_2m_mean),
.groups="drop"
) %>%
ggplot(aes(x=fct_mon)) +
geom_line(aes(y=mudp, group=fct_city, color=fct_city), lwd=1) +
labs(title="Average dewpoint by city and month", y="Average dewpoint (C)", x=NULL) +
scale_color_discrete(NULL)
Rolling 21-day average dewpoints by city are explored:
dewPDaily %>%
mutate(year=year(date), doy=pmin(365, yday(date))) %>%
arrange(fct_city, date) %>%
group_by(fct_city) %>%
helperRollingAgg(origVar="dewpoint_2m_mean", newName="dewpoint_r21", k=21) %>%
ungroup() %>%
filter(year>=2010, year<2023, !is.na(dewpoint_r21)) %>%
group_by(fct_city, doy) %>%
summarize(mudp=mean(dewpoint_r21),
maxdp=max(dewpoint_r21),
mindp=min(dewpoint_r21),
.groups="drop"
) %>%
ggplot(aes(x=doy)) +
geom_line(aes(y=mudp, group=fct_city, color=fct_city), lwd=1) +
labs(title="Average dewpoint by city and day of year",
subtitle="Rolling 21-day",
y="Average rolling 21-day dewpoint (C)",
x="Day of year"
) +
scale_color_discrete(NULL)
Dew points are added to the daily data:
allCityDaily_v2 <-
allCityDaily %>%
left_join(dewPDaily %>%
mutate(cityName=as.character(fct_city)) %>%
select(cityName, date, starts_with("dewp")),
by=c("cityName", "date")
)
allCityDaily_v2 %>%
group_by(cityName) %>%
summarize(n=n(), hasDew=sum(!is.na(dewpoint_2m_mean)))
## # A tibble: 7 × 3
## cityName n hasDew
## <chr> <int> <int>
## 1 Boston MA 5113 5113
## 2 Chicago IL 23376 5113
## 3 Houston TX 5113 5113
## 4 Las Vegas NV 5113 5113
## 5 Los Angeles CA 5113 5113
## 6 Miami FL 5113 5113
## 7 New York NY 4914 4914
Relationships among temperature, dewpoint, and apparent temperature are explored:
allCityDaily_v2 %>%
mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2,
muApp=(apparent_temperature_max + apparent_temperature_min)/2,
muDew=dewpoint_2m_mean
) %>%
filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
select(cityName, muTemp, muDew, muApp) %>%
mutate(across(where(is.numeric), .fns=round)) %>%
count(cityName, muTemp, muApp) %>%
ggplot(aes(x=muTemp, y=muApp)) +
geom_point(aes(size=n), alpha=0.5)+
facet_wrap(~cityName) +
labs(title="July apparent temperature vs. temperature",
x="Average daily temperature (C)",
y="Average daily\napparent temperature (C)"
) +
scale_size_continuous(NULL)
allCityDaily_v2 %>%
mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2,
muApp=(apparent_temperature_max + apparent_temperature_min)/2,
muDew=dewpoint_2m_mean
) %>%
filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
select(cityName, muTemp, muDew, muApp) %>%
mutate(across(where(is.numeric), .fns=round)) %>%
count(cityName, muDew, muApp) %>%
ggplot(aes(x=muDew, y=muApp)) +
geom_point(aes(size=n), alpha=0.5)+
facet_wrap(~cityName) +
labs(title="July apparent temperature vs. dewpoint",
x="Average daily dewpoint (C)",
y="Average daily\napparent temperature (C)"
) +
scale_size_continuous(NULL)
allCityDaily_v2 %>%
mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2,
muApp=(apparent_temperature_max + apparent_temperature_min)/2,
muDew=dewpoint_2m_mean
) %>%
filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
select(cityName, muTemp, muDew, muApp) %>%
mutate(across(where(is.numeric), .fns=round)) %>%
count(cityName, muDew, muTemp) %>%
ggplot(aes(x=muDew, y=muTemp)) +
geom_point(aes(size=n), alpha=0.5)+
facet_wrap(~cityName) +
labs(title="July temperature vs. dewpoint",
x="Average daily dewpoint (C)",
y="Average daily\ntemperature (C)"
) +
scale_size_continuous(NULL)
Relationships among temperature, dewpoint, and apparent temperature are further explored:
allCityDaily_v2 %>%
mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2,
muApp=(apparent_temperature_max + apparent_temperature_min)/2,
muDew=dewpoint_2m_mean
) %>%
filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
select(cityName, muTemp, muDew, muApp) %>%
mutate(across(where(is.numeric), .fns=round),
appr5=autoRound(muApp, rndTo=5L)
) %>%
count(cityName, muDew, muTemp, appr5) %>%
ggplot(aes(x=muDew, y=muTemp)) +
geom_point(aes(size=n, color=factor(appr5)), alpha=0.5)+
facet_wrap(~cityName) +
labs(title="July temperature vs. dewpoint",
x="Average daily dewpoint (C)",
y="Average daily\ntemperature (C)"
) +
scale_size_continuous(NULL) +
scale_color_discrete("App Temp\n(nearest 5 C)")
A simple linear regression is explored:
allCityDaily_v2 %>%
mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2,
muApp=(apparent_temperature_max + apparent_temperature_min)/2,
muDew=dewpoint_2m_mean
) %>%
filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
select(cityName, muTemp, muDew, muApp) %>%
lm(muApp ~ muTemp, data=.) %>%
summary()
##
## Call:
## lm(formula = muApp ~ muTemp, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1693 -1.3825 -0.1217 1.9902 4.6627
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.40918 0.29867 4.718 2.5e-06 ***
## muTemp 1.02750 0.01116 92.101 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.362 on 2819 degrees of freedom
## Multiple R-squared: 0.7506, Adjusted R-squared: 0.7505
## F-statistic: 8483 on 1 and 2819 DF, p-value: < 2.2e-16
allCityDaily_v2 %>%
mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2,
muApp=(apparent_temperature_max + apparent_temperature_min)/2,
muDew=dewpoint_2m_mean
) %>%
filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
select(cityName, muTemp, muDew, muApp) %>%
lm(muApp ~ muDew, data=.) %>%
summary()
##
## Call:
## lm(formula = muApp ~ muDew, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.865 -3.368 0.385 3.460 12.594
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.10439 0.23536 102.42 <2e-16 ***
## muDew 0.26039 0.01273 20.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.413 on 2819 degrees of freedom
## Multiple R-squared: 0.1293, Adjusted R-squared: 0.129
## F-statistic: 418.7 on 1 and 2819 DF, p-value: < 2.2e-16
On average, temperature is better than dewpoint as a predictor of apparent temperature. Interaction is also explored:
allCityDaily_v2 %>%
mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2,
muApp=(apparent_temperature_max + apparent_temperature_min)/2,
muDew=dewpoint_2m_mean
) %>%
filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
select(cityName, muTemp, muDew, muApp) %>%
lm(muApp ~ muTemp + muDew, data=.) %>%
summary()
##
## Call:
## lm(formula = muApp ~ muTemp + muDew, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5790 -0.5599 0.0916 0.6567 3.2703
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.110891 0.126564 -48.28 <2e-16 ***
## muTemp 1.091163 0.004236 257.58 <2e-16 ***
## muDew 0.337196 0.002586 130.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.891 on 2818 degrees of freedom
## Multiple R-squared: 0.9645, Adjusted R-squared: 0.9645
## F-statistic: 3.831e+04 on 2 and 2818 DF, p-value: < 2.2e-16
allCityDaily_v2 %>%
mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2,
muApp=(apparent_temperature_max + apparent_temperature_min)/2,
muDew=dewpoint_2m_mean
) %>%
filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
select(cityName, muTemp, muDew, muApp) %>%
lm(muApp ~ muTemp + muDew + muTemp:muDew, data=.) %>%
summary()
##
## Call:
## lm(formula = muApp ~ muTemp + muDew + muTemp:muDew, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5912 -0.5465 0.0883 0.6354 3.2415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.0232152 0.3884503 -10.357 < 2e-16 ***
## muTemp 1.0234331 0.0126449 80.936 < 2e-16 ***
## muDew 0.1842179 0.0270512 6.810 1.19e-11 ***
## muTemp:muDew 0.0051692 0.0009099 5.681 1.48e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8861 on 2817 degrees of freedom
## Multiple R-squared: 0.9649, Adjusted R-squared: 0.9649
## F-statistic: 2.583e+04 on 3 and 2817 DF, p-value: < 2.2e-16
The inclusion of both temperature and dewpoint meaningfully decreases residual standard error. The interaction term between temperature and dewpoint, while statistically significant, has little practical impact. City is added as an additional explanatory variable:
allCityDaily_v2 %>%
mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2,
muApp=(apparent_temperature_max + apparent_temperature_min)/2,
muDew=dewpoint_2m_mean
) %>%
filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
select(cityName, muTemp, muDew, muApp) %>%
lm(muApp ~ muTemp + muDew + cityName + 0, data=.) %>%
summary()
##
## Call:
## lm(formula = muApp ~ muTemp + muDew + cityName + 0, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.90625 -0.42940 0.06619 0.46037 2.76072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## muTemp 1.035693 0.005850 177.03 <2e-16 ***
## muDew 0.301535 0.004165 72.39 <2e-16 ***
## cityNameBoston MA -4.705775 0.145150 -32.42 <2e-16 ***
## cityNameChicago IL -4.732561 0.146445 -32.32 <2e-16 ***
## cityNameHouston TX -3.211754 0.177352 -18.11 <2e-16 ***
## cityNameLas Vegas NV -4.206984 0.188979 -22.26 <2e-16 ***
## cityNameLos Angeles CA -3.749403 0.140676 -26.65 <2e-16 ***
## cityNameMiami FL -3.267359 0.174773 -18.70 <2e-16 ***
## cityNameNew York NY -4.303946 0.154345 -27.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7383 on 2812 degrees of freedom
## Multiple R-squared: 0.9994, Adjusted R-squared: 0.9994
## F-statistic: 4.832e+05 on 9 and 2812 DF, p-value: < 2.2e-16
allCityDaily_v2 %>%
mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2,
muApp=(apparent_temperature_max + apparent_temperature_min)/2,
muDew=dewpoint_2m_mean
) %>%
filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
select(cityName, muTemp, muDew, muApp) %>%
lm(muApp ~ muTemp:cityName + muDew:cityName + cityName + 0, data=.) %>%
summary()
##
## Call:
## lm(formula = muApp ~ muTemp:cityName + muDew:cityName + cityName +
## 0, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.08814 -0.39255 0.06184 0.42885 1.90562
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## cityNameBoston MA -4.607592 0.285746 -16.125 < 2e-16 ***
## cityNameChicago IL -6.294447 0.250438 -25.134 < 2e-16 ***
## cityNameHouston TX -6.352102 0.886444 -7.166 9.84e-13 ***
## cityNameLas Vegas NV -0.360121 0.447277 -0.805 0.421
## cityNameLos Angeles CA -2.990764 0.372639 -8.026 1.47e-15 ***
## cityNameMiami FL -16.050511 1.563984 -10.263 < 2e-16 ***
## cityNameNew York NY -6.136167 0.349298 -17.567 < 2e-16 ***
## muTemp:cityNameBoston MA 0.984007 0.013408 73.392 < 2e-16 ***
## muTemp:cityNameChicago IL 1.063621 0.016338 65.100 < 2e-16 ***
## muTemp:cityNameHouston TX 1.008152 0.022276 45.258 < 2e-16 ***
## muTemp:cityNameLas Vegas NV 0.929141 0.013618 68.228 < 2e-16 ***
## muTemp:cityNameLos Angeles CA 0.993172 0.012892 77.041 < 2e-16 ***
## muTemp:cityNameMiami FL 1.119919 0.036174 30.960 < 2e-16 ***
## muTemp:cityNameNew York NY 1.043039 0.015807 65.984 < 2e-16 ***
## cityNameBoston MA:muDew 0.365493 0.012838 28.470 < 2e-16 ***
## cityNameChicago IL:muDew 0.350936 0.016710 21.001 < 2e-16 ***
## cityNameHouston TX:muDew 0.469955 0.023034 20.403 < 2e-16 ***
## cityNameLas Vegas NV:muDew 0.239456 0.005067 47.254 < 2e-16 ***
## cityNameLos Angeles CA:muDew 0.319501 0.014467 22.085 < 2e-16 ***
## cityNameMiami FL:muDew 0.741580 0.050220 14.767 < 2e-16 ***
## cityNameNew York NY:muDew 0.389827 0.012843 30.352 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6826 on 2800 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994
## F-statistic: 2.423e+05 on 21 and 2800 DF, p-value: < 2.2e-16
Including city reduces residual error slightly
Dewpoints by month are further explored:
dewPDaily %>%
mutate(gte68=(dewpoint_2m_mean>=20),
lte32=(dewpoint_2m_mean<=0)
) %>%
group_by(fct_city, fct_mon) %>%
summarize(gte68=mean(gte68), lte32=mean(lte32), .groups="drop") %>%
pivot_longer(cols=-c(fct_city, fct_mon)) %>%
ggplot(aes(x=fct_mon, y=value)) +
geom_line(aes(group=name, color=c("gte68"=">=68F", "lte32"="<=32F")[name])) +
facet_wrap(~fct_city) +
labs(x=NULL, y="Proportion of days", title="Proportion of days by month by dewpoint range") +
scale_color_discrete("Dewpoint")
Dewpoints at or above 20C (68F) by year are explored:
dewPDaily %>%
mutate(gte68=(dewpoint_2m_mean>=20),
lte32=(dewpoint_2m_mean<=0),
year=year(date),
doy=yday(date)
) %>%
filter(year < 2023) %>%
arrange(fct_city, date) %>%
group_by(fct_city, year) %>%
mutate(gte68=cumsum(gte68), lte32=cumsum(lte32)) %>%
ungroup() %>%
ggplot(aes(x=doy, y=gte68)) +
geom_line(aes(group=factor(year), color=factor(year))) +
facet_wrap(~fct_city, scales="free_y") +
labs(x=NULL,
y="Days with dewpoint >= 20C (68F)",
title="Cumulative days by year with dewpoint >= 20C (68F)"
) +
scale_color_discrete(NULL)
Dewpoints at or below 0C (32F) by year are explored:
dewPDaily %>%
mutate(gte68=(dewpoint_2m_mean>=20),
lte32=(dewpoint_2m_mean<=0),
year=year(date),
doy=yday(date)
) %>%
filter(year < 2023) %>%
arrange(fct_city, date) %>%
group_by(fct_city, year) %>%
mutate(gte68=cumsum(gte68), lte32=cumsum(lte32)) %>%
ungroup() %>%
ggplot(aes(x=doy, y=lte32)) +
geom_line(aes(group=factor(year), color=factor(year))) +
facet_wrap(~fct_city, scales="free_y") +
labs(x=NULL,
y="Days with dewpoint <= 0C (32F)",
title="Cumulative days by year with dewpoint <= 0C (32F)"
) +
scale_color_discrete(NULL)
Dewpoints at or above 20C (68F) by month are explored:
dewPDaily %>%
mutate(gte68=(dewpoint_2m_mean>=20),
lte32=(dewpoint_2m_mean<=0),
year=year(date),
doy=yday(date)
) %>%
filter(year < 2023) %>%
group_by(fct_city, fct_mon) %>%
summarize(gte68=sum(gte68), lte32=sum(lte32), .groups="drop") %>%
ggplot(aes(x=fct_mon, y=gte68)) +
geom_col(fill="lightblue") +
geom_text(aes(label=gte68), size=2.5, vjust=0) +
facet_wrap(~fct_city) +
labs(x=NULL,
y="# Days (2010-2022)",
title="Days with dewpoint >= 20C (68F)\n2010-2022"
) +
scale_color_discrete(NULL)
Dewpoints at or below 0C (32F) by month are explored:
dewPDaily %>%
mutate(gte68=(dewpoint_2m_mean>=20),
lte32=(dewpoint_2m_mean<=0),
year=year(date),
doy=yday(date)
) %>%
filter(year < 2023) %>%
group_by(fct_city, fct_mon) %>%
summarize(gte68=sum(gte68), lte32=sum(lte32), .groups="drop") %>%
ggplot(aes(x=fct_mon, y=lte32)) +
geom_col(fill="lightblue") +
geom_text(aes(label=lte32), size=2.5, vjust=0) +
facet_wrap(~fct_city) +
labs(x=NULL,
y="# Days (2010-2022)",
title="Days with dewpoint <= 0C (32F)\n2010-2022"
) +
scale_color_discrete(NULL)
Proportion of days with dewpoints at or above 20C (68F) by month are explored:
dewPDaily %>%
mutate(gte68=(dewpoint_2m_mean>=20),
lte32=(dewpoint_2m_mean<=0),
year=year(date),
doy=yday(date)
) %>%
filter(year < 2023) %>%
group_by(fct_city, fct_mon) %>%
summarize(gte68=mean(gte68), lte32=mean(lte32), .groups="drop") %>%
ggplot(aes(x=fct_mon, y=gte68)) +
geom_col(fill="lightblue") +
geom_text(aes(label=round(gte68,2)), size=2.5, vjust=0) +
facet_wrap(~fct_city) +
labs(x=NULL,
y="Proportion of Days (2010-2022)",
title="Days with dewpoint >= 20C (68F)\n2010-2022"
) +
scale_color_discrete(NULL)
Proportion of days with dewpoints at or below 0C (32F) by month are explored:
dewPDaily %>%
mutate(gte68=(dewpoint_2m_mean>=20),
lte32=(dewpoint_2m_mean<=0),
year=year(date),
doy=yday(date)
) %>%
filter(year < 2023) %>%
group_by(fct_city, fct_mon) %>%
summarize(gte68=mean(gte68), lte32=mean(lte32), .groups="drop") %>%
ggplot(aes(x=fct_mon, y=lte32)) +
geom_col(fill="lightblue") +
geom_text(aes(label=round(lte32,2)), size=2.5, vjust=0) +
facet_wrap(~fct_city) +
labs(x=NULL,
y="Proportion of Days (2010-2022)",
title="Days with dewpoint <= 0C (32F)\n2010-2022"
) +
scale_color_discrete(NULL)
ACF is calculated for dew point for a single city:
dewPDaily %>%
filter(fct_city=="Chicago IL") %>%
arrange(date) %>%
select(dewpoint_2m_mean) %>%
acf(lag.max=1000, main="ACF for daily dewpoint - Chicago IL (2010-2022)")
ACF is calculated for dew point for each city:
dewCities <- dewPDaily %>%
pull(fct_city) %>%
as.vector() %>%
unique()
dewACF <- lapply(dewCities,
FUN=function(x) {
dewPDaily %>%
filter(fct_city==x) %>%
arrange(date) %>%
select(dewpoint_2m_mean) %>%
acf(lag.max=2000, plot=FALSE)
}
)
names(dewACF) <- dewCities
ACF are combined into a single tibble:
dewACFAll <- map_dfr(.x=dewCities,
.f=function(x) tibble::tibble(lag=as.vector(dewACF[[x]][["lag"]]),
acf=as.vector(dewACF[[x]][["acf"]])
),
.id="src") %>%
mutate(cityName=dewCities[as.integer(src)])
dewACFAll
## # A tibble: 14,007 × 4
## src lag acf cityName
## <chr> <dbl> <dbl> <chr>
## 1 1 0 1 Boston MA
## 2 1 1 0.882 Boston MA
## 3 1 2 0.778 Boston MA
## 4 1 3 0.754 Boston MA
## 5 1 4 0.750 Boston MA
## 6 1 5 0.749 Boston MA
## 7 1 6 0.750 Boston MA
## 8 1 7 0.749 Boston MA
## 9 1 8 0.741 Boston MA
## 10 1 9 0.731 Boston MA
## # ℹ 13,997 more rows
dewACFAll %>%
filter(lag > 0) %>%
group_by(cityName) %>%
summarize(mu_abs_acf=mean(abs(acf))) %>%
arrange(desc(mu_abs_acf))
## # A tibble: 7 × 2
## cityName mu_abs_acf
## <chr> <dbl>
## 1 Chicago IL 0.388
## 2 Boston MA 0.377
## 3 New York NY 0.372
## 4 Houston TX 0.280
## 5 Miami FL 0.250
## 6 Los Angeles CA 0.222
## 7 Las Vegas NV 0.138
ACF by city are plotted:
dewACFAll %>%
filter(lag>0) %>%
ggplot(aes(x=lag, y=acf)) +
geom_line(aes(color=cityName, group=cityName), lwd=1) +
labs(x="Lag (zero excluded)", y="ACF", title="ACF of dewpoint by city") +
scale_color_discrete(NULL) +
lims(y=c(-1, 1))
PACF is calculated for dew point for a single city:
dewPDaily %>%
filter(fct_city=="Chicago IL") %>%
arrange(date) %>%
select(dewpoint_2m_mean) %>%
pacf(lag.max=1000, main="PACF for daily dewpoint - Chicago IL (2010-2022)")
dewPDaily %>%
filter(fct_city=="Chicago IL") %>%
arrange(date) %>%
select(dewpoint_2m_mean) %>%
pacf(lag.max=20, main="PACF for daily dewpoint - Chicago IL (2010-2022)")
PACF is calculated for dew point for each city:
dewPACF <- lapply(dewCities,
FUN=function(x) {
dewPDaily %>%
filter(fct_city==x) %>%
arrange(date) %>%
select(dewpoint_2m_mean) %>%
pacf(lag.max=2000, plot=FALSE)
}
)
names(dewPACF) <- dewCities
PACF are combined into a single tibble:
dewPACFAll <- map_dfr(.x=dewCities,
.f=function(x) tibble::tibble(lag=as.vector(dewPACF[[x]][["lag"]]),
pacf=as.vector(dewPACF[[x]][["acf"]])
),
.id="src") %>%
mutate(cityName=dewCities[as.integer(src)])
dewPACFAll
## # A tibble: 14,000 × 4
## src lag pacf cityName
## <chr> <dbl> <dbl> <chr>
## 1 1 1 0.882 Boston MA
## 2 1 2 -0.00177 Boston MA
## 3 1 3 0.306 Boston MA
## 4 1 4 0.127 Boston MA
## 5 1 5 0.158 Boston MA
## 6 1 6 0.126 Boston MA
## 7 1 7 0.103 Boston MA
## 8 1 8 0.0677 Boston MA
## 9 1 9 0.0515 Boston MA
## 10 1 10 0.0707 Boston MA
## # ℹ 13,990 more rows
dewPACFAll %>%
filter(lag > 0) %>%
group_by(cityName) %>%
summarize(mu_abs_pacf=mean(abs(pacf))) %>%
arrange(desc(mu_abs_pacf))
## # A tibble: 7 × 2
## cityName mu_abs_pacf
## <chr> <dbl>
## 1 Boston MA 0.0114
## 2 New York NY 0.0113
## 3 Chicago IL 0.0112
## 4 Houston TX 0.0110
## 5 Los Angeles CA 0.0109
## 6 Miami FL 0.0108
## 7 Las Vegas NV 0.0104
PACF by city are plotted:
dewPACFAll %>%
filter(lag>0) %>%
ggplot(aes(x=lag, y=pacf)) +
geom_line(aes(color=cityName, group=cityName), lwd=1) +
labs(x="Lag (zero excluded)", y="PACF", title="PACF of dewpoint by city") +
scale_color_discrete(NULL) +
lims(y=c(-1, 1)) +
scale_x_log10()
A baseline ARIMA model is explored, starting with c(0, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily %>%
filter(fct_city==x) %>%
pull(dewpoint_2m_mean) %>%
arima(order=c(0, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
intercept=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef),
se=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$var.coef %>% sqrt),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName intercept se sigma2 sigma
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 19.8 0.0629 20.3 4.50
## 2 Los Angeles CA 6.85 0.0953 46.4 6.81
## 3 Las Vegas NV -1.59 0.102 53.0 7.28
## 4 Houston TX 15.8 0.109 61.1 7.82
## 5 New York NY 6.56 0.144 102. 10.1
## 6 Boston MA 5.47 0.143 104. 10.2
## 7 Chicago IL 5.43 0.150 115. 10.7
The ARIMA model is explored at c(1, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(dewpoint_2m_mean) %>%
arima(order=c(1, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName sigma2 sigma ar1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 4.85 2.20 0.872 19.8
## 2 Los Angeles CA 10.9 3.30 0.875 6.85
## 3 Las Vegas NV 12.8 3.58 0.871 -1.59
## 4 Chicago IL 16.8 4.10 0.924 5.42
## 5 Houston TX 17.7 4.20 0.843 15.8
## 6 New York NY 21.1 4.60 0.891 6.56
## 7 Boston MA 23.0 4.80 0.882 5.47
The ARIMA model is explored at c(0, 0, 1):
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(dewpoint_2m_mean) %>%
arima(order=c(0, 0, 1))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName sigma2 sigma ma1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 8.38 2.90 0.796 19.8
## 2 Los Angeles CA 19.2 4.38 0.794 6.85
## 3 Las Vegas NV 22.5 4.75 0.776 -1.59
## 4 Houston TX 26.1 5.11 0.786 15.8
## 5 New York NY 42.9 6.55 0.782 6.56
## 6 Chicago IL 43.6 6.60 0.821 5.42
## 7 Boston MA 44.4 6.66 0.776 5.47
The ARIMA model is explored at c(1, 0, 1):
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(dewpoint_2m_mean) %>%
arima(order=c(1, 0, 1))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 6
## cityName sigma2 sigma ar1 ma1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 4.58 2.14 0.799 0.315 19.8
## 2 Los Angeles CA 10.3 3.21 0.804 0.307 6.85
## 3 Las Vegas NV 12.2 3.50 0.807 0.269 -1.59
## 4 Houston TX 16.3 4.04 0.733 0.394 15.8
## 5 Chicago IL 16.6 4.08 0.902 0.151 5.43
## 6 New York NY 21.1 4.60 0.878 0.0594 6.56
## 7 Boston MA 23.0 4.80 0.881 0.00734 5.47
The ARIMA model is explored at c(2, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(dewpoint_2m_mean) %>%
arima(order=c(2, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 6
## cityName sigma2 sigma ar1 ar2 intercept
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 4.68 2.16 1.04 -0.187 19.8
## 2 Los Angeles CA 10.5 3.24 1.03 -0.181 6.85
## 3 Las Vegas NV 12.4 3.52 1.03 -0.183 -1.59
## 4 Chicago IL 16.7 4.09 0.989 -0.0703 5.43
## 5 Houston TX 17.0 4.12 1.00 -0.191 15.8
## 6 New York NY 21.1 4.60 0.908 -0.0194 6.56
## 7 Boston MA 23.0 4.80 0.884 -0.00206 5.47
The auto.arima() process is run to estimate optimal parameters for a single city:
dewPDaily %>%
filter(fct_city=="Chicago IL") %>%
arrange(date) %>%
pull(dewpoint_2m_mean) %>%
forecast::auto.arima()
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Series: .
## ARIMA(5,0,0) with non-zero mean
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 mean
## 0.9676 -0.3158 0.1975 -0.0276 0.1317 5.4251
## s.e. 0.0139 0.0194 0.0197 0.0194 0.0139 1.1627
##
## sigma^2 = 15.17: log likelihood = -14205.69
## AIC=28425.37 AICc=28425.39 BIC=28471.15
The auto.arima() process is run to estimate optimal parameters for all cities:
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(dewpoint_2m_mean) %>%
forecast::auto.arima()
)
names(tstARIMA) <- dewCities
The optimal model by city is pulled:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t()
## ar ma d sig2
## Boston MA 5 0 0 20.01
## Chicago IL 5 0 0 15.17
## Houston TX 5 0 0 15.04
## Las Vegas NV 2 3 0 11.83
## Los Angeles CA 2 2 0 9.71
## Miami FL 2 4 1 4.26
## New York NY 5 0 0 18.56
Sigma-squared is also explored:
cat("\n\nOptimal Parameters (ARIMA for daily mean dewpoint by city):\n")
##
##
## Optimal Parameters (ARIMA for daily mean dewpoint by city):
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
full_join(dewPDaily %>% group_by(fct_city) %>% summarize(varOrig=var(dewpoint_2m_mean)),
by=c("cityName"="fct_city")
) %>%
mutate(pct=1-sig2/varOrig)
## cityName ar ma d sig2 varOrig pct
## 1 Boston MA 5 0 0 20.01 103.86622 0.8073483
## 2 Chicago IL 5 0 0 15.17 115.11075 0.8682139
## 3 Houston TX 5 0 0 15.04 61.13661 0.7539935
## 4 Las Vegas NV 2 3 0 11.83 53.01145 0.7768406
## 5 Los Angeles CA 2 2 0 9.71 46.41365 0.7907943
## 6 Miami FL 2 4 1 4.26 20.26281 0.7897626
## 7 New York NY 5 0 0 18.56 102.44193 0.8188242
Rolling 21-day dewpoints by city are plotted:
dewPDaily %>%
mutate(doy=yday(date)) %>%
arrange(date) %>%
group_by(fct_city) %>%
helperRollingAgg(origVar="dewpoint_2m_mean", newName="dewpoint_r21", k=21) %>%
group_by(fct_city, doy) %>%
summarize(dewpoint_r21=mean(dewpoint_r21, na.rm=TRUE), .groups="drop") %>%
ggplot(aes(x=doy)) +
geom_line(aes(y=dewpoint_r21), lwd=2) +
geom_point(data=dewPDaily %>% mutate(doy=yday(date), year=year(date)) %>% arrange(date),
aes(y=dewpoint_2m_mean, group=year),
alpha=0.25,
size=0.1
) +
facet_wrap(~fct_city) +
labs(x="Day of year",
y="Dewpoint (C)",
title="Dewpoint by day of year\n(actual and rolling 21-day average)"
)
Daily dewpoint data are converted to deviation from long-term rolling 21-day mean by day of year:
dewPDaily_vs_r21 <- dewPDaily %>%
mutate(doy=pmin(yday(date), 365), year=year(date)) %>%
left_join(dewPDaily %>%
mutate(doy=pmin(yday(date),365)) %>%
arrange(date) %>%
group_by(fct_city) %>%
helperRollingAgg(origVar="dewpoint_2m_mean", newName="dewpoint_r21", k=21) %>%
group_by(fct_city, doy) %>%
summarize(dewpoint_r21=mean(dewpoint_r21, na.rm=TRUE), .groups="drop"),
by=c("fct_city", "doy")
) %>%
mutate(delta_r21=dewpoint_2m_mean-dewpoint_r21)
dewPDaily_vs_r21
## # A tibble: 35,592 × 10
## fct_city fct_mon date dewpoint_2m_mean dewpoint_2m_min dewpoint_2m_max
## <fct> <fct> <date> <dbl> <dbl> <dbl>
## 1 Boston MA Jan 2010-01-01 -3.48 -5.1 -1.5
## 2 Boston MA Jan 2010-01-02 -4.96 -10 -0.6
## 3 Boston MA Jan 2010-01-03 -9.23 -12.8 -4.7
## 4 Boston MA Jan 2010-01-04 -6.86 -8.6 -4.7
## 5 Boston MA Jan 2010-01-05 -8.16 -9.4 -6.5
## 6 Boston MA Jan 2010-01-06 -8.53 -9.3 -7.4
## 7 Boston MA Jan 2010-01-07 -7.86 -9.2 -5.5
## 8 Boston MA Jan 2010-01-08 -8.65 -10.9 -7.5
## 9 Boston MA Jan 2010-01-09 -13.5 -17 -11.1
## 10 Boston MA Jan 2010-01-10 -16.4 -18 -14.4
## # ℹ 35,582 more rows
## # ℹ 4 more variables: doy <dbl>, year <dbl>, dewpoint_r21 <dbl>,
## # delta_r21 <dbl>
summary(dewPDaily_vs_r21)
## fct_city fct_mon date dewpoint_2m_mean
## Boston MA :5113 Jan : 3038 Min. :2010-01-01 Min. :-35.76667
## Chicago IL :5113 Mar : 3038 1st Qu.:2013-06-25 1st Qu.: -0.01771
## Houston TX :5113 May : 3038 Median :2016-12-17 Median : 9.47500
## Las Vegas NV :5113 Jul : 3007 Mean :2016-12-17 Mean : 8.34512
## Los Angeles CA:5113 Aug : 3007 3rd Qu.:2020-06-10 3rd Qu.: 17.53333
## Miami FL :5113 Oct : 3007 Max. :2023-12-31 Max. : 25.94167
## New York NY :4914 (Other):17457
## dewpoint_2m_min dewpoint_2m_max doy year
## Min. :-38.600 Min. :-32.00 Min. : 1.0 Min. :2010
## 1st Qu.: -3.500 1st Qu.: 3.80 1st Qu.: 91.0 1st Qu.:2013
## Median : 6.300 Median : 12.50 Median :182.0 Median :2016
## Mean : 5.499 Mean : 11.28 Mean :182.6 Mean :2016
## 3rd Qu.: 15.200 3rd Qu.: 19.90 3rd Qu.:274.0 3rd Qu.:2020
## Max. : 25.200 Max. : 27.60 Max. :365.0 Max. :2023
##
## dewpoint_r21 delta_r21
## Min. :-8.98471 Min. :-28.33807
## 1st Qu.:-0.04931 1st Qu.: -2.77771
## Median : 8.78258 Median : 0.33513
## Mean : 8.34958 Mean : -0.00446
## 3rd Qu.:16.58411 3rd Qu.: 3.06098
## Max. :23.96929 Max. : 19.74022
##
A baseline ARIMA model is explored, starting with c(0, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(0, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
intercept=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef),
se=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$var.coef %>% sqrt),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName intercept se sigma2 sigma
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL -0.0220 0.0441 9.95 3.15
## 2 Los Angeles CA 0.0118 0.0701 25.1 5.01
## 3 Houston TX -0.0141 0.0729 27.2 5.21
## 4 Boston MA 0.00479 0.0733 27.5 5.24
## 5 New York NY -0.0177 0.0750 27.6 5.26
## 6 Chicago IL 0.00493 0.0740 28.0 5.29
## 7 Las Vegas NV 0.000614 0.0816 34.1 5.84
The ARIMA model is explored at c(1, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(1, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName sigma2 sigma ar1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 4.52 2.13 0.739 -0.0221
## 2 Los Angeles CA 10.3 3.21 0.768 0.0118
## 3 Las Vegas NV 12.3 3.51 0.799 0.000632
## 4 Chicago IL 14.7 3.84 0.688 0.00495
## 5 Houston TX 15.8 3.97 0.648 -0.0141
## 6 New York NY 17.8 4.22 0.596 -0.0177
## 7 Boston MA 19.0 4.36 0.556 0.00478
The ARIMA model is explored at c(0, 0, 1):
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(0, 0, 1))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName sigma2 sigma ma1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 5.08 2.25 0.724 -0.0219
## 2 Los Angeles CA 12.3 3.51 0.734 0.0117
## 3 Houston TX 15.1 3.88 0.706 -0.0141
## 4 Chicago IL 15.8 3.98 0.678 0.00488
## 5 Las Vegas NV 16.4 4.05 0.731 0.000566
## 6 New York NY 17.7 4.21 0.622 -0.0176
## 7 Boston MA 18.4 4.29 0.606 0.00477
The ARIMA model is explored at c(1, 0, 1):
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(1, 0, 1))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 6
## cityName sigma2 sigma ar1 ma1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 4.11 2.03 0.572 0.406 -0.0222
## 2 Los Angeles CA 9.45 3.07 0.622 0.385 0.0118
## 3 Las Vegas NV 11.6 3.40 0.691 0.318 0.000754
## 4 Houston TX 13.7 3.70 0.397 0.512 -0.0141
## 5 Chicago IL 13.8 3.72 0.503 0.371 0.00503
## 6 New York NY 16.8 4.10 0.353 0.397 -0.0178
## 7 Boston MA 17.9 4.23 0.276 0.429 0.00474
The ARIMA model is explored at c(2, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(2, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 6
## cityName sigma2 sigma ar1 ar2 intercept
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 4.17 2.04 0.945 -0.279 -0.0219
## 2 Los Angeles CA 9.62 3.10 0.963 -0.253 0.0118
## 3 Las Vegas NV 11.7 3.41 0.982 -0.229 0.000716
## 4 Chicago IL 14.0 3.75 0.839 -0.219 0.00523
## 5 Houston TX 14.0 3.75 0.863 -0.332 -0.0139
## 6 New York NY 17.1 4.13 0.719 -0.207 -0.0176
## 7 Boston MA 18.1 4.26 0.673 -0.211 0.00475
Sigma-squared for a single city is compared across models:
pdqList <- list("c(0, 0, 0)"=c(0, 0, 0),
"c(1, 0, 0)"=c(1, 0, 0),
"c(0, 0, 1)"=c(0, 0, 1),
"c(1, 0, 1)"=c(1, 0, 1),
"c(2, 0, 0)"=c(2, 0, 0)
)
sapply(pdqList, FUN=function(pdq) lapply("Boston MA", FUN=function(x) { tmp<- dewPDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(x=., order=unname(pdq)); tmp$sigma2}
)
) %>%
as_tibble() %>%
t()
## [,1]
## c(0, 0, 0) 27.48617
## c(1, 0, 0) 18.98769
## c(0, 0, 1) 18.41728
## c(1, 0, 1) 17.87572
## c(2, 0, 0) 18.14050
Sigma-squared for all cities is compiled across models:
pdqList <- list("c(0, 0, 0)"=c(0, 0, 0),
"c(1, 0, 0)"=c(1, 0, 0),
"c(0, 0, 1)"=c(0, 0, 1),
"c(1, 0, 1)"=c(1, 0, 1),
"c(2, 0, 0)"=c(2, 0, 0)
)
pdqAll <- sapply(pdqList, FUN=function(pdq) sapply(dewCities,
FUN=function(x) {
tmp<- dewPDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(x=., order=unname(pdq))
tmp$sigma2
}
)
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("ARIMA") %>%
as_tibble()
pdqAll
## # A tibble: 5 × 8
## ARIMA `Boston MA` `Chicago IL` `Houston TX` `Las Vegas NV` `Los Angeles CA`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 c(0, 0,… 27.5 28.0 27.2 34.0 25.1
## 2 c(1, 0,… 19.0 14.7 15.8 12.3 10.3
## 3 c(0, 0,… 18.4 15.8 15.1 16.4 12.3
## 4 c(1, 0,… 17.9 13.8 13.7 11.6 9.45
## 5 c(2, 0,… 18.1 14.0 14.0 11.7 9.62
## # ℹ 2 more variables: `Miami FL` <dbl>, `New York NY` <dbl>
Sigma-squared for all cities is plotted:
pdqAll %>%
pivot_longer(cols=-ARIMA, names_to="city") %>%
group_by(city) %>%
mutate(pct=value/max(value)) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) +
facet_wrap(~city) +
geom_col() +
theme(legend.position = "none") +
geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) +
labs(x="ARIMA(p, d, q)",
y="Sigma-squared\nrelative to ARIMA(0, 0, 0)",
title="Relative sigma-squared by model",
subtitle="(daily dewpoint vs. long-term 21-day mean for day of year)"
)
The auto.arima() process is run to estimate optimal parameters for all cities:
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
forecast::auto.arima(d=0)
)
names(tstARIMA) <- dewCities
The auto.arima() parameters are pulled for all cities:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t()
## ar ma d sig2
## Boston MA 1 3 0 17.82
## Chicago IL 3 2 0 13.79
## Houston TX 1 3 0 13.65
## Las Vegas NV 1 1 0 11.55
## Los Angeles CA 2 2 0 9.42
## Miami FL 5 5 0 4.10
## New York NY 2 3 0 16.72
Sigma-squared for all cities is updated:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("name") %>%
select(name, value=sig2) %>%
mutate(ARIMA="auto") %>%
pivot_wider(id_cols="ARIMA") %>%
bind_rows(pdqAll) %>%
pivot_longer(cols=-ARIMA, names_to="city") %>%
group_by(city) %>%
mutate(pct=value/max(value)) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) +
facet_wrap(~city) +
geom_col() +
theme(legend.position = "none") +
geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) +
labs(x="ARIMA(p, d, q)",
y="Sigma-squared\nrelative to ARIMA(0, 0, 0)",
title="Relative sigma-squared by model",
subtitle="(daily dewpoint vs. long-term 21-day mean for day of year)"
)
Sigma-squared for all cities is updated, incuding the baseline:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("name") %>%
select(name, value=sig2) %>%
mutate(ARIMA="auto") %>%
pivot_wider(id_cols="ARIMA") %>%
bind_rows(pdqAll) %>%
bind_rows(dewPDaily %>%
group_by(fct_city) %>%
summarize(value=var(dewpoint_2m_mean)) %>%
mutate(ARIMA="base") %>%
pivot_wider(id_cols="ARIMA", names_from="fct_city")
) %>%
pivot_longer(cols=-ARIMA, names_to="city") %>%
group_by(city) %>%
mutate(pct=value/max(value)) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) +
facet_wrap(~city) +
geom_col() +
theme(legend.position = "none") +
geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) +
labs(x="ARIMA(p, d, q)",
y="Sigma-squared\nrelative to ARIMA(0, 0, 0)",
title="Relative sigma-squared by model",
subtitle="(daily dewpoint vs. long-term 21-day mean for day of year)",
caption="base=daily dewpoint (no adjustmens)\nc(p, d, q) and auto are all vs. long-term 21-day mean"
)
Lag/lead for dewpoint vs. 21-day mean is plotted:
dewPDaily_vs_r21 %>%
arrange(fct_city, date) %>%
group_by(fct_city) %>%
mutate(dewNext=lead(delta_r21)) %>%
ungroup() %>%
filter(complete.cases(.)) %>%
count(fct_city, delta_r21=round(delta_r21), dewNext=round(dewNext)) %>%
ggplot(aes(x=delta_r21, y=dewNext)) +
geom_point(aes(size=n), alpha=0.1) +
facet_wrap(~fct_city) +
geom_smooth(aes(weight=n), method="lm") +
annotate("point", x=0, y=0, color="red") +
geom_abline(intercept=0, slope=1, color="red", lty=2) +
labs(x="Dewpoint vs. long-term 21-day mean",
y="Next day's dewpoint vs. long-term 21-day mean",
title="Dewpoint vs. long-term 21-day mean by day of year and city"
) +
scale_size_continuous("# Obs")
## `geom_smooth()` using formula = 'y ~ x'
Daily mean temperature data are converted to deviation from long-term rolling 21-day mean by day of year:
tempDaily_vs_r21 <- allCityDaily %>%
mutate(doy=pmin(yday(date), 365),
year=year(date),
meanTemp=0.5*(temperature_2m_max+temperature_2m_min),
fct_city=factor(cityName)
) %>%
select(fct_city, date, doy, year, meanTemp) %>%
left_join(allCityDaily %>%
mutate(doy=pmin(yday(date),365),
meanTemp=0.5*(temperature_2m_max+temperature_2m_min),
fct_city=factor(cityName)
) %>%
arrange(date) %>%
group_by(fct_city) %>%
helperRollingAgg(origVar="meanTemp", newName="temp_r21", k=21) %>%
group_by(fct_city, doy) %>%
summarize(temp_r21=mean(temp_r21, na.rm=TRUE), .groups="drop"),
by=c("fct_city", "doy")
) %>%
mutate(delta_r21=meanTemp-temp_r21)
tempDaily_vs_r21
## # A tibble: 53,855 × 7
## fct_city date doy year meanTemp temp_r21 delta_r21
## <fct> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Boston MA 2010-01-01 1 2010 0.5 0.112 0.388
## 2 Boston MA 2010-01-02 2 2010 -2.55 0.0786 -2.63
## 3 Boston MA 2010-01-03 3 2010 -5.15 0.00934 -5.16
## 4 Boston MA 2010-01-04 4 2010 -2.5 -0.152 -2.35
## 5 Boston MA 2010-01-05 5 2010 -3.15 -0.338 -2.81
## 6 Boston MA 2010-01-06 6 2010 -3 -0.396 -2.60
## 7 Boston MA 2010-01-07 7 2010 -1.65 -0.486 -1.16
## 8 Boston MA 2010-01-08 8 2010 -4.6 -0.552 -4.05
## 9 Boston MA 2010-01-09 9 2010 -8.5 -0.567 -7.93
## 10 Boston MA 2010-01-10 10 2010 -9.05 -0.608 -8.44
## # ℹ 53,845 more rows
summary(tempDaily_vs_r21)
## fct_city date doy year
## Boston MA : 5113 Min. :1960-01-01 Min. : 1.0 Min. :1960
## Chicago IL :23376 1st Qu.:1996-11-10 1st Qu.: 91.0 1st Qu.:1996
## Houston TX : 5113 Median :2013-05-22 Median :183.0 Median :2013
## Las Vegas NV : 5113 Mean :2006-02-14 Mean :182.8 Mean :2006
## Los Angeles CA: 5113 3rd Qu.:2018-08-28 3rd Qu.:274.0 3rd Qu.:2018
## Miami FL : 5113 Max. :2023-12-31 Max. :365.0 Max. :2023
## New York NY : 4914
## meanTemp temp_r21 delta_r21
## Min. :-29.00 Min. :-5.275 Min. :-23.877046
## 1st Qu.: 6.75 1st Qu.: 6.956 1st Qu.: -2.371354
## Median : 16.40 Median :16.192 Median : 0.025680
## Mean : 14.53 Mean :14.532 Mean : -0.000837
## 3rd Qu.: 23.30 3rd Qu.:22.876 3rd Qu.: 2.402418
## Max. : 39.75 Max. :32.763 Max. : 18.633780
##
A baseline ARIMA model is explored, starting with c(0, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) tempDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(0, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
intercept=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef),
se=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$var.coef %>% sqrt),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName intercept se sigma2 sigma
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL -0.0182 0.0293 4.40 2.10
## 2 Los Angeles CA 0.00593 0.0453 10.5 3.24
## 3 Las Vegas NV 0.00745 0.0473 11.4 3.38
## 4 Houston TX -0.0135 0.0517 13.7 3.69
## 5 New York NY -0.00916 0.0540 14.3 3.79
## 6 Boston MA 0.00000680 0.0556 15.8 3.97
## 7 Chicago IL 0.00400 0.0303 21.5 4.64
Lag/lead for temperature vs. 21-day mean is plotted:
tempDaily_vs_r21 %>%
arrange(fct_city, date) %>%
group_by(fct_city) %>%
mutate(tempNext=lead(delta_r21)) %>%
ungroup() %>%
filter(complete.cases(.)) %>%
count(fct_city, delta_r21=round(delta_r21), tempNext=round(tempNext)) %>%
ggplot(aes(x=delta_r21, y=tempNext)) +
geom_point(aes(size=n), alpha=0.1) +
facet_wrap(~fct_city) +
geom_smooth(aes(weight=n), method="lm") +
annotate("point", x=0, y=0, color="red") +
geom_abline(intercept=0, slope=1, color="red", lty=2) +
labs(x="Temperature vs. long-term 21-day mean",
y="Next day's temperature vs. long-term 21-day mean",
title="Temperature vs. long-term 21-day mean by day of year and city"
) +
scale_size_continuous("# Obs")
## `geom_smooth()` using formula = 'y ~ x'
The ARIMA model is explored at c(1, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) tempDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(1, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName sigma2 sigma ar1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 2.12 1.46 0.721 -0.0183
## 2 Los Angeles CA 3.31 1.82 0.827 0.00592
## 3 Las Vegas NV 4.31 2.08 0.789 0.00746
## 4 Houston TX 6.44 2.54 0.727 -0.0134
## 5 New York NY 7.88 2.81 0.671 -0.00917
## 6 Boston MA 9.17 3.03 0.648 0.00000623
## 7 Chicago IL 9.74 3.12 0.740 0.00400
The ARIMA model is explored at c(0, 0, 1):
tstARIMA <- lapply(dewCities,
FUN=function(x) tempDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(0, 0, 1))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName sigma2 sigma ma1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 2.38 1.54 0.696 -0.0181
## 2 Los Angeles CA 4.70 2.17 0.758 0.00586
## 3 Las Vegas NV 5.66 2.38 0.718 0.00737
## 4 Houston TX 6.95 2.64 0.725 -0.0135
## 5 New York NY 8.61 2.93 0.636 -0.00914
## 6 Boston MA 9.58 3.10 0.638 0.0000114
## 7 Chicago IL 11.3 3.36 0.708 0.00400
The ARIMA model is explored at c(1, 0, 1):
tstARIMA <- lapply(dewCities,
FUN=function(x) tempDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(1, 0, 1))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 6
## cityName sigma2 sigma ar1 ma1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 1.99 1.41 0.559 0.356 -0.0183
## 2 Los Angeles CA 2.96 1.72 0.728 0.366 0.00589
## 3 Las Vegas NV 4.06 2.01 0.685 0.298 0.00754
## 4 Houston TX 5.70 2.39 0.542 0.451 -0.0134
## 5 New York NY 7.61 2.76 0.511 0.297 -0.00923
## 6 Boston MA 8.70 2.95 0.449 0.358 -0.0000145
## 7 Chicago IL 9.21 3.03 0.589 0.346 0.00401
The ARIMA model is explored at c(2, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) tempDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(2, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 6
## cityName sigma2 sigma ar1 ar2 intercept
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 2.02 1.42 0.877 -0.217 -0.0181
## 2 Los Angeles CA 2.95 1.72 1.10 -0.330 0.00578
## 3 Las Vegas NV 4.06 2.02 0.978 -0.239 0.00747
## 4 Houston TX 5.82 2.41 0.952 -0.310 -0.0133
## 5 New York NY 7.68 2.77 0.779 -0.162 -0.00904
## 6 Boston MA 8.82 2.97 0.774 -0.195 0.0000453
## 7 Chicago IL 9.35 3.06 0.888 -0.200 0.00400
Sigma-squared for all cities is compiled across models:
pdqList <- list("c(0, 0, 0)"=c(0, 0, 0),
"c(1, 0, 0)"=c(1, 0, 0),
"c(0, 0, 1)"=c(0, 0, 1),
"c(1, 0, 1)"=c(1, 0, 1),
"c(2, 0, 0)"=c(2, 0, 0)
)
pdqAll <- sapply(pdqList, FUN=function(pdq) sapply(dewCities,
FUN=function(x) {
tmp<- tempDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(x=., order=unname(pdq))
tmp$sigma2
}
)
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("ARIMA") %>%
as_tibble()
pdqAll
## # A tibble: 5 × 8
## ARIMA `Boston MA` `Chicago IL` `Houston TX` `Las Vegas NV` `Los Angeles CA`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 c(0, 0,… 15.8 21.5 13.6 11.4 10.5
## 2 c(1, 0,… 9.16 9.74 6.44 4.31 3.31
## 3 c(0, 0,… 9.58 11.3 6.95 5.66 4.70
## 4 c(1, 0,… 8.70 9.21 5.69 4.06 2.96
## 5 c(2, 0,… 8.82 9.35 5.82 4.06 2.95
## # ℹ 2 more variables: `Miami FL` <dbl>, `New York NY` <dbl>
Sigma-squared for all cities is plotted:
pdqAll %>%
pivot_longer(cols=-ARIMA, names_to="city") %>%
group_by(city) %>%
mutate(pct=value/max(value)) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) +
facet_wrap(~city) +
geom_col() +
theme(legend.position = "none") +
geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) +
labs(x="ARIMA(p, d, q)",
y="Sigma-squared\nrelative to ARIMA(0, 0, 0)",
title="Relative sigma-squared by model",
subtitle="(daily temperature vs. long-term 21-day mean for day of year)"
)
The auto.arima() process is run to estimate optimal parameters for all cities:
tstARIMA <- lapply(dewCities,
FUN=function(x) tempDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
forecast::auto.arima(d=0)
)
names(tstARIMA) <- dewCities
The auto.arima() parameters are pulled for all cities:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t()
## ar ma d sig2
## Boston MA 3 3 0 8.63
## Chicago IL 2 4 0 9.12
## Houston TX 4 1 0 5.63
## Las Vegas NV 1 4 0 4.05
## Los Angeles CA 1 5 0 2.94
## Miami FL 2 3 0 1.97
## New York NY 2 3 0 7.51
Sigma-squared for all cities is updated:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("name") %>%
select(name, value=sig2) %>%
mutate(ARIMA="auto") %>%
pivot_wider(id_cols="ARIMA") %>%
bind_rows(pdqAll) %>%
pivot_longer(cols=-ARIMA, names_to="city") %>%
group_by(city) %>%
mutate(pct=value/max(value)) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) +
facet_wrap(~city) +
geom_col() +
theme(legend.position = "none") +
geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) +
labs(x="ARIMA(p, d, q)",
y="Sigma-squared\nrelative to ARIMA(0, 0, 0)",
title="Relative sigma-squared by model",
subtitle="(daily temperature vs. long-term 21-day mean for day of year)"
)
Sigma-squared for all cities is updated, incuding the baseline:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("name") %>%
select(name, value=sig2) %>%
mutate(ARIMA="auto") %>%
pivot_wider(id_cols="ARIMA") %>%
bind_rows(pdqAll) %>%
bind_rows(tempDaily_vs_r21 %>%
group_by(fct_city) %>%
summarize(value=var(meanTemp)) %>%
mutate(ARIMA="base") %>%
pivot_wider(id_cols="ARIMA", names_from="fct_city")
) %>%
pivot_longer(cols=-ARIMA, names_to="city") %>%
group_by(city) %>%
mutate(pct=value/max(value)) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) +
facet_wrap(~city) +
geom_col() +
theme(legend.position = "none") +
geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) +
labs(x="ARIMA(p, d, q)",
y="Sigma-squared\nrelative to ARIMA(0, 0, 0)",
title="Relative sigma-squared by model",
subtitle="(daily temperature vs. long-term 21-day mean for day of year)",
caption="base=daily temperature (no adjustmens)\nc(p, d, q) and auto are all vs. long-term 21-day mean"
)
Daily maximum wind speed data are converted to deviation from long-term rolling 21-day mean by day of year:
windDaily_vs_r21 <- allCityDaily %>%
mutate(doy=pmin(yday(date), 365),
year=year(date),
maxWind=windspeed_10m_max,
fct_city=factor(cityName)
) %>%
select(fct_city, date, doy, year, maxWind) %>%
left_join(allCityDaily %>%
mutate(doy=pmin(yday(date),365),
maxWind=windspeed_10m_max,
fct_city=factor(cityName)
) %>%
arrange(date) %>%
group_by(fct_city) %>%
helperRollingAgg(origVar="maxWind", newName="wind_r21", k=21) %>%
group_by(fct_city, doy) %>%
summarize(wind_r21=mean(wind_r21, na.rm=TRUE), .groups="drop"),
by=c("fct_city", "doy")
) %>%
mutate(delta_r21=maxWind-wind_r21)
windDaily_vs_r21
## # A tibble: 53,855 × 7
## fct_city date doy year maxWind wind_r21 delta_r21
## <fct> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Boston MA 2010-01-01 1 2010 13.1 22.5 -9.35
## 2 Boston MA 2010-01-02 2 2010 34.7 22.6 12.1
## 3 Boston MA 2010-01-03 3 2010 34.8 22.7 12.1
## 4 Boston MA 2010-01-04 4 2010 28.1 22.6 5.50
## 5 Boston MA 2010-01-05 5 2010 17.1 22.5 -5.40
## 6 Boston MA 2010-01-06 6 2010 20.7 22.4 -1.69
## 7 Boston MA 2010-01-07 7 2010 21.1 22.3 -1.18
## 8 Boston MA 2010-01-08 8 2010 21.5 22.2 -0.738
## 9 Boston MA 2010-01-09 9 2010 22 22.3 -0.274
## 10 Boston MA 2010-01-10 10 2010 20 22.4 -2.41
## # ℹ 53,845 more rows
summary(windDaily_vs_r21)
## fct_city date doy year
## Boston MA : 5113 Min. :1960-01-01 Min. : 1.0 Min. :1960
## Chicago IL :23376 1st Qu.:1996-11-10 1st Qu.: 91.0 1st Qu.:1996
## Houston TX : 5113 Median :2013-05-22 Median :183.0 Median :2013
## Las Vegas NV : 5113 Mean :2006-02-14 Mean :182.8 Mean :2006
## Los Angeles CA: 5113 3rd Qu.:2018-08-28 3rd Qu.:274.0 3rd Qu.:2018
## Miami FL : 5113 Max. :2023-12-31 Max. :365.0 Max. :2023
## New York NY : 4914
## maxWind wind_r21 delta_r21
## Min. : 4.30 Min. :11.69 Min. :-18.59129
## 1st Qu.:14.30 1st Qu.:17.66 1st Qu.: -4.38622
## Median :18.90 Median :20.46 Median : -0.68503
## Mean :19.91 Mean :19.91 Mean : -0.00301
## 3rd Qu.:24.50 3rd Qu.:23.23 3rd Qu.: 3.60730
## Max. :97.00 Max. :25.18 Max. : 79.88061
##
Lag/lead for maximum wind speed vs. 21-day mean is plotted:
windDaily_vs_r21 %>%
arrange(fct_city, date) %>%
group_by(fct_city) %>%
mutate(windNext=lead(delta_r21)) %>%
ungroup() %>%
filter(complete.cases(.)) %>%
count(fct_city, delta_r21=round(delta_r21), windNext=round(windNext)) %>%
ggplot(aes(x=delta_r21, y=windNext)) +
geom_point(aes(size=n), alpha=0.1) +
facet_wrap(~fct_city) +
geom_smooth(aes(weight=n), method="lm") +
annotate("point", x=0, y=0, color="red") +
geom_abline(intercept=0, slope=1, color="red", lty=2) +
labs(x="Maximum wind speed vs. long-term 21-day mean",
y="Next day's maximum wind speed vs. long-term 21-day mean",
title="Maximum wind speed vs. long-term 21-day mean by day of year and city"
) +
scale_size_continuous("# Obs")
## `geom_smooth()` using formula = 'y ~ x'
A baseline ARIMA model is explored, starting with c(0, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) windDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(0, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
intercept=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef),
se=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$var.coef %>% sqrt),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName intercept se sigma2 sigma
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Los Angeles CA -0.0117 0.0555 15.8 3.97
## 2 New York NY 0.00404 0.0812 32.4 5.69
## 3 Miami FL 0.00896 0.0827 34.9 5.91
## 4 Houston TX -0.00981 0.0827 35.0 5.91
## 5 Boston MA -0.0122 0.0879 39.5 6.29
## 6 Chicago IL 0.000733 0.0454 48.2 6.95
## 7 Las Vegas NV -0.0141 0.111 62.7 7.92
The ARIMA model is explored at c(1, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) windDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(1, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName sigma2 sigma ar1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Los Angeles CA 12.3 3.51 0.468 -0.0117
## 2 Miami FL 24.9 4.99 0.536 0.00885
## 3 New York NY 28.7 5.36 0.336 0.00409
## 4 Houston TX 28.9 5.38 0.416 -0.00984
## 5 Boston MA 35.7 5.98 0.311 -0.0122
## 6 Chicago IL 41.7 6.46 0.368 0.000734
## 7 Las Vegas NV 48.6 6.97 0.474 -0.0141
The ARIMA model is explored at c(0, 0, 1):
tstARIMA <- lapply(dewCities,
FUN=function(x) windDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(0, 0, 1))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName sigma2 sigma ma1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Los Angeles CA 12.8 3.57 0.419 -0.0118
## 2 Miami FL 25.9 5.09 0.499 0.00896
## 3 New York NY 28.2 5.31 0.382 0.00398
## 4 Houston TX 29.0 5.38 0.415 -0.00976
## 5 Boston MA 35.3 5.94 0.345 -0.0124
## 6 Chicago IL 41.4 6.44 0.383 0.000734
## 7 Las Vegas NV 50.4 7.10 0.426 -0.0142
The ARIMA model is explored at c(1, 0, 1):
tstARIMA <- lapply(dewCities,
FUN=function(x) windDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(1, 0, 1))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 6
## cityName sigma2 sigma ar1 ma1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Los Angeles CA 12.3 3.51 0.476 -0.0104 -0.0116
## 2 Miami FL 24.7 4.97 0.418 0.166 0.00853
## 3 New York NY 28.2 5.31 0.0118 0.372 0.00451
## 4 Houston TX 28.7 5.35 0.234 0.222 -0.00996
## 5 Boston MA 35.3 5.94 0.0187 0.328 -0.0122
## 6 Chicago IL 41.3 6.43 0.144 0.261 0.000744
## 7 Las Vegas NV 48.6 6.97 0.463 0.0145 -0.0140
The ARIMA model is explored at c(2, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) windDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(2, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 6
## cityName sigma2 sigma ar1 ar2 intercept
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Los Angeles CA 12.3 3.51 0.466 0.00332 -0.0117
## 2 Miami FL 24.7 4.97 0.583 -0.0893 0.00861
## 3 New York NY 28.3 5.32 0.378 -0.124 0.00395
## 4 Houston TX 28.7 5.36 0.452 -0.0869 -0.00971
## 5 Boston MA 35.3 5.94 0.344 -0.108 -0.0124
## 6 Chicago IL 41.4 6.43 0.401 -0.0896 0.000733
## 7 Las Vegas NV 48.6 6.97 0.476 -0.00481 -0.0140
Sigma-squared for all cities is compiled across models:
pdqList <- list("c(0, 0, 0)"=c(0, 0, 0),
"c(1, 0, 0)"=c(1, 0, 0),
"c(0, 0, 1)"=c(0, 0, 1),
"c(1, 0, 1)"=c(1, 0, 1),
"c(2, 0, 0)"=c(2, 0, 0)
)
pdqAll <- sapply(pdqList, FUN=function(pdq) sapply(dewCities,
FUN=function(x) {
tmp<- windDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(x=., order=unname(pdq))
tmp$sigma2
}
)
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("ARIMA") %>%
as_tibble()
pdqAll
## # A tibble: 5 × 8
## ARIMA `Boston MA` `Chicago IL` `Houston TX` `Las Vegas NV` `Los Angeles CA`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 c(0, 0,… 39.5 48.2 35.0 62.6 15.7
## 2 c(1, 0,… 35.7 41.7 28.9 48.6 12.3
## 3 c(0, 0,… 35.3 41.4 29.0 50.4 12.8
## 4 c(1, 0,… 35.3 41.3 28.7 48.6 12.3
## 5 c(2, 0,… 35.3 41.4 28.7 48.6 12.3
## # ℹ 2 more variables: `Miami FL` <dbl>, `New York NY` <dbl>
Sigma-squared for all cities is plotted:
pdqAll %>%
pivot_longer(cols=-ARIMA, names_to="city") %>%
group_by(city) %>%
mutate(pct=value/max(value)) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) +
facet_wrap(~city) +
geom_col() +
theme(legend.position = "none") +
geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) +
labs(x="ARIMA(p, d, q)",
y="Sigma-squared\nrelative to ARIMA(0, 0, 0)",
title="Relative sigma-squared by model",
subtitle="(daily maximum wind speed vs. long-term 21-day mean for day of year)"
)
The auto.arima() process is run to estimate optimal parameters for all cities:
tstARIMA <- lapply(dewCities,
FUN=function(x) windDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
forecast::auto.arima(d=0)
)
names(tstARIMA) <- dewCities